2016 FONSINT SI

advertisement
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/306017370
Temporal Analysis of Radical Dark Web Forum Users
Conference Paper · August 2016
DOI: 10.1109/ASONAM.2016.7752341
CITATIONS
READS
11
345
5 authors, including:
Andrew Park
Herbert H Tsang
Trinity Western University
Trinity Western University
35 PUBLICATIONS 184 CITATIONS
57 PUBLICATIONS 486 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
RNA Secondary Sturcture Design View project
RNA Secondary Structure Prediction View project
All content following this page was uploaded by Andrew Park on 14 September 2018.
The user has requested enhancement of the downloaded file.
SEE PROFILE
2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)
Temporal Analysis of Radical Dark Web
Forum Users
Andrew J. Park∗ , Brian Beck, Darrick Fletche, Patrick Lam, and Herbert H. Tsang†
∗ Department
of Computing Science, Thompson Rivers University, Kamloops, British Columbia, Canada
Research Lab, Trinity Western University, Langley, British Columbia, Canada
Email: apark@tru.ca∗ and herbert.tsang@twu.ca¶
† Applied
Abstract—Extremist groups have turned to the Internet and
social media sites as a means of sharing information amongst
one another. This research study analyzes forum posts and finds
people who show radical tendencies through the use of natural
language processing and sentiment analysis. The forum data
being used are from six Islamic forums on the Dark Web which are
made available for security research. This research project uses
a POS tagger to isolate keywords and nouns that can be utilized
with the sentiment analysis program. Then the sentiment analysis
program determines the polarity of the post. The post is scored
as either positive or negative. These scores are then divided into
monthly radical scores for each user. Once these time clusters
are mapped, the change in opinions of the users over time may
be interpreted as rising or falling levels of radicalism. Each user
is then compared on a timeline to other radical users and events
to determine possible connections or relationships. The ability
to analyze a forum for an overall change in attitude can be an
indicator of unrest and possible radical actions or terrorism.
I. I NTRODUCTION
Communication on a global scale has become commonplace
and necessary to keep pace with the events of the world. This
is applicable to nearly every member of society: from the
average citizen conversing with relatives in another country, to
government agencies coordinating relief efforts, to extremist or
terrorist groups devising their next attack [1]. Radical groups
such as ISIS (Islamic State of Iraq and Syria) require a means
of spreading their propaganda and messages in the hopes of
recruiting new members. The Internet is an all encompassing
network that allows such organizations to plan and carry out
their oftentimes violent goals.
Recently, these extremist groups utilized mainstream social
media for the purposes of communication and recruitment.
This included sites such as Facebook and Twitter [2]. However,
these companies have altered their terms and conditions in
order to filter out and delete posts that contained violent and
extremist content. In light of these changes, these groups have
moved communications to a lesser known part of the Internet
known as the Deep Web. The Deep Web contains websites
not accessible or linked through from common search engines
like Google. To access the Deep Web users require specialized
software to reach the hidden network of routers. By design,
this network allows for relative anonymity because in order
to trace back to a particular user, you would have to travel
through a series of different routers around the globe. This
IEEE/ACM ASONAM 2016, August 18-21, 2016, San Francisco, CA, USA
978-1-5090-2846-7/16/$31.00 c 2016 IEEE
varying path through routers makes tracking very difficult. A
subset of the Deep Web is the Dark Web. This section is home
to a black market for many illicit goods, the most common
being illegal drugs. It is also an area where users can post
anonymously through certain forums.
It is in these forums that radical users and groups can speak
freely of their ideologies and opinions. Our research examines
whether it is possible to analyze the opinion over time of these
postings to identify users that are escalating in extremism and
to correlate these escalations with real world terrorist events
[3] [4] [5].
II. BACKGROUND
The Dark Web archive is a collection of critical jihadist and
extremist web forums [6]. This archive has been accessible
for researchers and practitioners to study and analyze various
social and computational problems. A number of studies using
the Dark Web archive have tried to detect the indicators of
extremism using techniques like data mining, textural analysis,
and sentiment analysis [7], [8], [9], [10]. In particular, Scrivens
et al. developed a method of identifying the most radical
users across four Dark Web discussion forums using sentiment
analysis techniques and their Sentiment-based Identification
of Radical Authors (SIRA) algorithm [10]. However, their
method simply looks at the posting lifetime of each user in
the forums. This paper presents our preliminary research study
that has furthered their work by investigating the change in
opinion over time and possible temporal correlations between
the forum users’ radicalism and real world terrorist events.
III. M ETHOD
Our analysis consists of five steps: a) Data Preparation, b)
Part of Speech Tagging, c) Sentiment Analysis, d) Forum Temporal Analysis, and e) User Temporal Analysis. The following
sections document the detailed procedure and analysis for each
step.
A. Data Preparation
This research is based on forum data from 6 different
forums on the Dark Web. The forums used are Gawaher,
Islamic Network, Islamic Awakening, Turn to Islam, Myiwc,
and Ummah. These forum posts range from the year 2000
all the way to 2012. The number of users for the forum
are 48, 324 and total number of posts in our analysis are
2, 547, 463. This time frame gives us a large amount of data
to analyze, as well as the ability to correlate the forum data
to terrorist events that have happened. The data from the
forum posts in the form of text files were transferred into an
SQLite database for easier access and manipulation.
B. POS tagging
Once the forum data were in the database, forum posts
were analyzed to find sentiment scores. In this step, we
examined the forum posts and broke down the posts into
individual words and phases for sentiment analysis in
the next step. We employed OpenNLP for this particular process. OpenNLP (Open Natural Language Processor,
http://opennlp.sourceforge.net ) is free natural language processor from the Apache suite of open source software. This
tool offers many different language processing options but we
mainly used the tokenizer which splits plain text into an array
of words and tokens, and the Parts of Speech (POS) tagger
which tags each word or token as that part of speech that it
is, e.g. noun, verb, adjective etc.
The forum posts were first run through the OpenNLP
tokenizer to create an array with each element containing a
word, or a token such as punctuation. The array was then
used by the POS Tagger to find all of the nouns in each post,
and add them to a count to keep track of the frequency of each
noun. Once all the posts had been processed we then created
a list of the top 100 most frequently used nouns. These 100
nouns gave us a picture of the topics being discussed, which
matched topics that might have attracted radical users.
C. Sentiment Analysis (Opinion Mining)
After the forum posts were broken down, we performed the
sentiment analysis on these posts. SentiStrength is a sentiment
analysis tool [11]. It takes plain text as input, and analyzes it
for opinion scores that range from -5 (very negative) to 5 (very
positive), and with 0 being neutral. SentiStrength also has the
ability to analyze around keywords instead of the overall text
so you can get opinion scores directed at certain topics. In
our research, every post was analyzed using SentiStrength
keyword analysis, using our 100 most frequent nouns that
we obtained from previous steps as the keywords to focus
our analysis around. This ensured that we were analyzing
sentiment related to the topics related to our research, and
leaving out any opinions on unrelated topics. This helped to
create more accurate sentiment scores for what we needed.
This analysis tagged and scored each post in the database with
a sentiment score ranging from -5 to 5.
D. Forum Temporal Analysis
Then the sentiment scores for posts from each forum were
combined on a month to month basis creating an average
monthly sentiment score per forum. This shows the average
sentiment of all users in each forum and how it changes each
from one month to the next. This was used to track negative or
positive spikes and correlate them to the months when major
terrorist events occurred.
In the data we examined for this paper (2000 to 2012),
three terrorist events were selected where the forum temporal
analysis showed drastic negative spikes in sentiment scores.
The first event was the bombings in London, England on July
7, 2005. Event two was in Rawalpindi, Pakistan on December
27, 2007 where Benazir Bhutto was assassinated and 24 others
were killed by suicide bombs. The last event was in Mumbai,
India November 26-29, 2008 where coordinated shootings and
bombings rocked the city.
E. User Temporal Analysis
Using the three events chosen based on the forum temporal
analysis from the previous step, the analysis of individual users
was performed. The goal was to take a closer look at individual
sentiment scores and how they would change on a day-to-day
basis during the month of a major terrorist event.
Initially, a top 20 list was created for the most negative,
and positive users based on their average sentiment scores
over their forum posting lifetime. This was done to compare
the difference between our negative users and positive users.
Any users with less than 500 posts were eliminated from the
running due to limited forum use. This provided us with a base
to start individual analysis around terrorist events. However
we soon realized that only 4-5 users from the top 20 negative
posters and 1-5 users from the top 20 positive users were active
during the month of each event. Thus top 10 lists for positive
and negative users were drafted for each event based on the
average sentiment scores of users only during that month,
ensuring that all 10 users are active during the month of the
event.
IV. R ESULTS & D ISCUSSION
From the graph of the total monthly forum data (Fig. 1), we
can see the total average sentiment score is negative or below
zero score. This indicates that a majority of the postings done
by users were of a negative nature centered around the most
used key nouns. Based on the negative spikes, we were able to
find certain real world terrorist attacks that correlate with them.
Whenever most of the real world terrorist attacks happened,
deep negative spikes occurred at the same times. We believe
these spikes are caused by discussion of the terrorist events in
question. One event in particular, the assassination of Benazir
Bhutto, coincided very closely with a large dip in sentiment
across almost all forums during that time.
After looking at the forums as a whole, we began to
look at individual users of the forums to see if any patterns
could be extracted. Taking the 2005 London Bombing as an
example, we mapped the top 5 negative users who were active
during that time (Fig. 2). User 1558 seemed to be the only
actively posting user during the time of the bombing. No real
significance can be garnered from the user’s postings however.
The chart illustrating average scores during the attack (Fig. 3)
shows a downward trend towards the event itself.
Fig. 1. Combined average sentiment scores over lifetime of forums with selected terrorist events.
Fig. 2. Top negative users during 2005 London Bombing.
Fig. 3. Average sentiment scores during 2005 London Bombing.
V. C ONCLUSION AND F UTURE D IRECTIONS
This paper presents a novel way of analyzing the Dark
Web forums through temporal analyses of forums and users
as well as sentiment analysis to discover trends and patterns
of sentiment scores over time correlated to real world terrorist
events. In this study, we have analyzed over two and a half
million posts of Dark Web forum data. We were able to find
correlations between overall monthly forum scores and real
world terrorist events. These trends could be caused by discussion generated from the events themselves. Negative change
in sentiment over time for a certain user can certainly indicate
their overall attitude towards certain issues, and may very
well be cause for concern for law enforcement or government
agencies. At this time, however, we cannot conclusively say
whether a user experiencing a negative sentiment spike is truly
radical or not. Close examination of the forum postings and
content would determine if they were truly a threat or not.
Future research could include an automated solution to analyze
Dark Web forum posts for extreme changes in sentiment.
ACKNOWLEDGMENT
The authors would like to acknowledge the generous provision of the data by the Data Infrastructure Building Blocks
(DIBBs) for ISI project funded by the National Science Foundation (ACI-1443019). We would also like to thank Thompson
Rivers University and Trinity Western University for their
support.
View publication stats
R EFERENCES
[1] M. Ashcroft, A. Fisher, L. Kaati, E. Omer, and N. Prucha, “Detecting jihadist messages on twitter,” in EISIC 2015, September 7–9, Manchester,
UK. IEEE Computer Society, 2015.
[2] M. J. G. Lapayese, “Terrorism and its transition to cyberspace,” in Intelligence and Security Informatics Conference (EISIC), 2015 European.
IEEE, 2015, pp. 178–178.
[3] R. Colbaugh and K. Glass, “Estimating sentiment orientation in social
media for intelligence monitoring and analysis,” in 2010 IEEE International Conference on Intelligence and Security Informatics (ISI). IEEE,
2010, pp. 135–137.
[4] B. Liu, “Sentiment analysis and opinion mining,” Synthesis lectures on
human language technologies, vol. 5, no. 1, pp. 1–167, 2012.
[5] M. Thelwall, K. Buckley, and G. Paltoglou, “Sentiment strength detection for the social web,” Journal of the American Society for Information
Science and Technology, vol. 63, no. 1, pp. 163–173, 2012.
[6] Y. Zhang, S. Zeng, C.-N. Huang, L. Fan, X. Yu, Y. Dang, C. A. Larson,
D. Denning, N. Roberts, and H. Chen, “Developing a dark web collection
and infrastructure for computational and social sciences,” in 2010 IEEE
International Conference on Intelligence and Security Informatics (ISI).
IEEE, 2010, pp. 59–64.
[7] H. Chen, W. Chung, J. Qin, E. Reid, M. Sageman, and G. Weimann,
“Uncovering the dark web: A case study of jihad on the web,” Journal of
the American Society for Information Science and Technology, vol. 59,
no. 8, pp. 1347–1359, 2008.
[8] A. Abbasi, H. Chen, and A. Salem, “Sentiment analysis in multiple
languages: Feature selection for opinion classification in web forums,”
ACM Transactions on Information Systems (TOIS), vol. 26, no. 3, p. 12,
2008.
[9] H. Chen, Dark web: Exploring and data mining the dark side of the
web. Springer Science & Business Media, 2011, vol. 30.
[10] R. Scrivens, G. Davies, R. Frank, and J. Mei, “Sentiment-based identification of radical authors (sira),” 2015 IEEE International Conference
on Data Mining Workshop (ICDMW), 2015.
[11] M. Thelwall, K. Buckley, and G. Paltoglou, “Sentiment strength
detection for the social web,” Journal of the American Society for
Information Science and Technology, vol. 63, no. 1, pp. 163–173, Jan.
2012. [Online]. Available: http://dx.doi.org/10.1002/asi.21662
Download