A Novel Approach for Recognized & Overcrowding of Terrorist Websites

advertisement
International Journal of Engineering Trends and Technology- Volume4Issue3- 2013
A Novel Approach for Recognized &
Overcrowding of Terrorist Websites
Prof. G. A. Patil
Computer science & engineering department,
D. Y. Patil College of Engineering & Technology, Kolhapur, Maharashtra, India
Prof. K. B. Manwade
Computer science & engineering department,
Ashokrao Mane Group of Institution, Vathar tarf Vadgaon, Maharashtra, India
Mr. P. S. Landge
Computer science & engineering department,
D. Y. Patil College of Engineering & Technology, Kolhapur, Maharashtra, India
Abstract-The extremists mainly utilize the Internet to
Keywords: -Web content analysis, Web usage analysis,
enhance their information operations surrounding
Web collection building, Social Network.
propaganda,
communication,
and psychological
warfare. Islamic militant organizations, such as Al
Quaeda,
Hamas,
Hezbollah,
etc.,
have
intensively utilizing the Internet to disseminate their
anti-Western,
anti-Israel
propaganda,
I.
been
provide
training materials to their supporters, plan their
INTRODUCTION
Studying the sophistication of global extremist
organizations’ Web presence would allow us to better
understand
extremist
organizations’
technical
operations, and raise funds by selling goods through
sophistication, their access to information technology
their Web sites; limit our experiential understanding
related resources, and their propaganda plans.
of their Internet usage. To address this research gap,
However, due to the covert nature of the Dark Web
we explore an integrated approach for identifying
and the lack of efficient automatic methodologies to
and collecting terrorist/extremist Web contents and
monitor and analyze large amount of Web contents.
to discover hidden relationships among communities.
It has been shown in the literature that content
analysis
gives
more
insight
of
technical
sophistication, content richness; whereas the link
analysis focuses on the web interactivity. We
Terrorist organizations have generated thousands
of Web sites that support psychological warfare,
fundraising,
recruitment,
coordination,
and
distribution of propaganda materials. The level of
introduced a quantitative Dark Web content analysis
technical sophistication of the Islamic terrorist
tool called the Dark Web Attribute System (DWAS)
organizations' Web sites has increased according to
and tested it by applying the DWAS in the study of
Katz, who monitors Islamic fundamentalist Internet
the extremist organizations’ Internet usage.
activities. The rapid proliferation and increased
This present work focuses on identifying &
sophistication of Web sites and online forums run by
analyzing new web page attributes. It is aimed to
terrorist/extremist organizations are indications of the
compare different terrorist/extremist sites with
genuine sites and accordingly prepare metrics which
can be further used for identification of other sites of
terrorist/extremist groups. Further the attempt to
growing popularity of the Internet in terrorism
campaigns. They also indicate that there is a vast pool
of
sympathizers that
such
organizations have
visualize and analyze hidden domestic terrorism
attracted, with some applying their IT expertise as
communities
contributions to the cause.
and intercommunity
among all web sites in our collection.
relationships
The Web has evolved towards multimedia-rich
content
ISSN: 2231-5381
delivery,
end
http://www.internationaljournalssrg.org
user
personal
content
Page 463
International Journal of Engineering Trends and Technology- Volume4Issue3- 2013
generation, and community-based social interactions.
technologies in the global terrorism phenomena.
Due to the freedom and convenience of publishing in
DWAS is used to visualize and analyze hidden
Weblogs, [2] [3] this form of media provides an ideal
domestic terrorism community and intercommunity
environment as a propaganda platform for extremist
[7] relationship. The DWAS helps in identifying the
or terrorist groups to promote their ideologies.
groups that are considered by authoritative sources as
Criminals may also make use of the virtual
terrorist/extremist
environment to organize crimes such as money
government agency reports, authoritative organization
laundering and drugs trafficking without being easily
reports and studies published by terrorism research
identified. As a result, it is important to understand
centers. Also DWAS identify a set of seed terrorist
the social network of the bloggers in order to assess
group URLs from the authoritative sources and the
the risks that may threaten the national security.
terrorism keyword lexicon to query major search
groups.
The
sources
include
engines on the Web. After identifying the seed URLs,
II.
out-links and in-links of the seed URLs were
LITERATURE SURVEY
In social networks analysis the main task is
automatically extracted using link-analysis programs.
usually about how to extract social networks from
Once the terrorist/extremist Web sites are identified, a
different communication resources [4]. The data used
program is used to automatically download all their
for building social networks is relational data, which
contents. The DWAS framework focuses on the
can be obtained and transferred from different
attributes that could help us better understand the
resources including the web, email communication,
level of advancement and effectiveness of terrorists'
Internet relay chats, telephone communications,
Web
organization and business events, etc.In recent years,
attributes, content richness attributes (an extension of
there have been studies of how terrorists use the Web
the traditional media richness attributes), and Web
to facilitate their activities. The first step towards
interactivity attributes.
studying terrorists' tactical use of the Web is to build a
usage,
namely,
technical
sophistication
Still DWAS have scope for improvement in
identifying and analyzing new attributes for content
high-quality Dark Web [5] collection.
The rapid expansion of the web is causing the
constant growth of information, leading to several
analysis, applying new data mining algorithm for link
analysis as suggested in [1][7][8].
problems such as increased difficulty of extracting
potentially useful knowledge. Web content mining [6]
confronts this problem gathering explicit information
III.
MODIFIED DARK WEB ATTRIBUTE
SYSTEM FOR COUNTER TERRORISM
from different web sites for its access and knowledge
The aim of this proposed work is, to identify the
discovery. Web mining is concerned with the use of
terrorists name and their web sites and then download
data mining techniques to automatically discover and
the web site contents for analysis purpose. The
extract
Web
various steps involved in this are identifying terrorist
documents and services. Web content mining
leader name, identifying terrorist group URLs and
approach to extract information from web based
expanding terrorist URL set through link and forum
databases.
analysis. Also there is a need to cluster the related
information
from
World
Wide
The DWAS is an effective tool to analyze the
websites and define new set of Web interactivity
technical sophistication of terrorist/extremist groups'
attributes for calculating the web content and to
Internet usage and could contribute to an evidence
understand
based understanding of the applications of Web
effectiveness of terrorists' Web usage. These new set
ISSN: 2231-5381
the
level
http://www.internationaljournalssrg.org
of
advancement
and
Page 464
International Journal of Engineering Trends and Technology- Volume4Issue3- 2013
of attributes are shown in Table No. 1 it consist of
dark Web Attribute system will have the following
nine high level attributes, each of which is composed
modules,
of multiple fine grained low level attributes.
High
level
Low level
attributes
Description
Weight
attributes
Menu
The use of menu tag for design the websites
2
Technical
Meta
The use of meta tag for design the websites
2.5
Sophistication
Style
The use of style tag for design the websites
1
attribute
label
The use of label tag for design the websites
2.5
Span
The use of span tag for design the websites
3
Form
The use of form tag for design the websites
1.5
Fundamental
Frame
The use of frame tag for design the websites
2
attribute
Table
The use of table tag for design the websites
2
Advanced technical
Java script
Use of java script language
4
sophistication
Script
Use of script language
4
Java
Use of java language
2.5
PHP
Scripting language designed for Web development
5
attribute
Dynamic
web
programming
to produce dynamic Web pages
ASP
Use of Active Server Pages (ASP)
5.5
Flash
Banner depicting
representative figure, graphical
1
representative figure, graphical
1
Content Richness
symbol or seal
Image
Banner depicting
symbol or seal
Audio
Short
phrase
with
religious
or
ideological
1
connotation
Video
Video on religion, attack etc.
1
Communication
List
List of leader name, address etc.
2.3
(User generated
Contact
Telephone number
1.2
content)
Email
Email address
2.5
Comment
Allow the user to give feedback or ask question to
2.4
Online
the site owner or maintainer
Organizational
Videoconference
Video clip of bombings, game, animated picture, etc.
3.3
Online recruitment
Invitation to join or attend meeting, interview etc.
4.5
E-tendering attributes
Invitation & publish the E-tendering information
4.5
attribute
Web
interactivity
attribute
The
proposed
dark
web
attributes
system
architecture as shown in figure No.1 The proposed
Table no. 1 Table No.1 Attributes used in the content analysis
Module No.1 Dark web collection
ISSN: 2231-5381
Identify the terrorist name and URLs of
http://www.internationaljournalssrg.org
Page 465
International Journal of Engineering Trends and Technology- Volume4Issue3- 2013
terrorist groups from dark web. Then using link
using an automatic web crawling toolkit called
analysis program automatically extract the URL out-
spidersRUs one can download the entire web
link and in-link. The robust filtering method will be
document within these sites.
applied to identify essential terrorist websites. Then,
Figure No.1. Proposed Dark Web Attribute System Architecture
Module No.2 Content analysis of terrorist web
sites
(
)
∑
(
)
Module No.3 Link analysis of terrorist web sites
After
downloading
the
entire
web
To find relationship among different web sites
document within these sites, Apache POI (well-known in
for the same group and the interaction with other
the Java field) is used to read the entire web document
extremist group, first step is to calculate similarity
and then write in excel sheet format. There are twenty
between all web site pair in the collection. Similarity can
seven types of attributes that are selected for analysis
be defined as real value multivariable function of the
purpose as shown in table no. 1. These attributes are
number of hyperlink in web site A pointing to web site B
assigned the weight values. Then calculate the weight of
and the number of hyperlink in site B pointing to site A.
that attribute. Finally find out benchmark comparison
Hyperlink appearing at sites homepage has a higher
result between terrorist websites and genuine websites.
weight than hyperlink appearing at a deeper level. The
When data from all websites belonging to a cluster is
similarity between web site A and B will be calculated
aggregated and the normalized content level is calculated
by using formula given in equation (1).
into six dimensions. Each dimension represents a
(
)
∑
normalized activity scale between 0 and 1, showing the
degree of activity on the dimension. The activity scale of
cluster c on dimension d is calculated by the following
formula where n is the total number of attributes in
……(1)
( )
Where lv (L) is the level of link L in the web sites
hierarchy, with the homepage as level 0, and each lower
level in the hierarchy is increased by one.
dimension d, while m is the total number of web sites
belonging to cluster c.
ISSN: 2231-5381
http://www.internationaljournalssrg.org
Page 466
International Journal of Engineering Trends and Technology- Volume4Issue3- 2013
Module No. 4 Clustering of terrorist web sites
The
Business
system
planning
(BSP)
The relative standard deviation (RSD) is often times
more convenient.
It is expressed in percent and is
clustering algorithm is used to form the clusters of
obtained by multiplying the standard deviation by 100
terrorist web sites. Based on the Initial results, the seed
and dividing this product by the average.
URL’s were crawled for finding clusters amongst the
dark web. The crawling was done for two more
iterations, so as to get one level of direct association and
one level of indirect association between the websites.
The linking has been shown with the nodes and edges
graph, which shows the name of the nodes being
connected by a particular edge. The first iteration, helped
in identifying the associations between the dark web
URL’s whereas second iteration helped differentiate the
clusters based on two step associations.
Confidence Interval is a particular kind of interval
estimate of a population parameter and is used to
indicate the reliability of an estimate. It is an observed
interval (i.e. it is calculated from the observations), in
principle different from sample to sample, that
frequently includes the parameter of interest, if the
experiment is repeated. How frequently the observed
interval contains the parameter is determined by the
Module No. 5 Identified & blocking of terrorist
websites
confidence level or confidence coefficient.
IV.
The approach involves application of robust
EXPERIMENTAL RESULT
Module No.1 Dark web collection
filtering methods to find essential websites and remove
the unwanted websites, which would be useful for
further analysis purpose. Extending and implementing
the web crawler to download the entire web document
within these sites, enables the parameter gathering.
Different statistical calculations were carried out on the
data, in order to finalize the benchmark.
The name & URLs of terrorist groups are
identified from Government report such as FBI, US State
department and research centers MEMRI, ATC etc. Web
crawler is further used to automatically extract the URL
out-link and in-link.
Module No.2 Content analysis of terrorist web sites
The average result (Mean) X is calculated by summing
the individual results and dividing this sum by the
The content was downloaded for shortlisted terrorist
number (n) of individual values:
URLs. There are twenty seven types of attributes that
were selected for analysis purpose. Five major attribute
groups were formed as, Technical Sophistication
attribute ( by grouping menu, meta, label, style, span,
The standard deviation is a measure of how precise the
average is, that is, how well the individual numbers
agree with each other. It is a measure of a type of error
called random error - the kind of error people can’t
control very well. It is calculated as follows:
form, frame and table attributes), Advanced technical
sophistication ( by grouping java script and script
attributes), Dynamic web programming ( by grouping
Java, PHP, ASP attributes), Content Richness ( by
grouping Flash, image, audio, video, no. of hyperlink,
no. of download document attributes ), and Web
Standard deviation,
=
interactivity attributes (by grouping list, contact, email,
comment, videoconference, online recruitment and Etendering attributes). The graphs were plotted to study
their relationship further. The Figure No. 2 shows the
ISSN: 2231-5381
http://www.internationaljournalssrg.org
Page 467
International Journal of Engineering Trends and Technology- Volume4Issue3- 2013
attributes gathered and their calculations. The entire
information is combined to form the total of each URLs
attribute count. A consolidated sheet has been created
which represents data for the entire group.
Module No.3 Link analysis of terrorist web sites
After
identifying
the
seed
URLs,
We
extracted out-links and in-links of the seed URLs using
an automatic link-analysis programs. The expanded
extremist URL set was them manually filtered by
domain experts to ensure that irrelevant and bogus Web
sites did not make way into our collection. After the
filtering, a total of 1116 extremist group URLs were
included in our final URL set. Figure no. 3 shows the
link tree of terrorist websites
Figure No 2 Snapshot of Attribute count for a terrorist
website.
For better analysis purpose the count of terrorist
websites
attributes
is
compare
with
genuine
websites(Government websites US, India, Australia etc.
& standard organization Infosys, TCS, Reliance etc )
The Graph No.1 shows clear difference between
averages of both the groups. The mean count for genuine
websites groups is higher for every attribute. The
Terrorist/Extremist’s web usage is more concentrated on
content richness. Other attributes like web interactivity,
Figure No 3. Link Tree for terrorist websites
Module No. 4 Clustering of terrorist web sites
are less focused, which helps them in one way
communication. Average usage of Basic/Advanced
Technical
Sophistication
and
Dynamic
Web
Based on the Initial results, the seed URL’s
were crawled for finding clusters amongst the dark web.
The crawling was done for two more iterations, so as to
Programming
get one level of direct association and one level of
indirect association between the websites. The linking
has been shown with the nodes and edges graph, which
shows the name of the nodes being connected by a
particular edge. The first iteration, helped in identifying
the associations between the dark web URL’s where as
second iteration helped differentiate the clusters based
Graph No. 1
Comparison between averages of counts
for all the attributes, for both the groups.
on two step associations. The figure no. 4 shows the
cluster of terrorist websites created by using business
system planning clustering algorithm (BSP)
ISSN: 2231-5381
http://www.internationaljournalssrg.org
Page 468
International Journal of Engineering Trends and Technology- Volume4Issue3- 2013
Graph No. 2 Benchmark of terrorist websites
IV. CONCLUSION
In this work identified the terrorist websites from anti
terrorism research centre, FBI, Government etc. then
downloaded the content of these websites. These
Figure No. 4 Cluster of terrorist websites
contents are compared with government (legitimate)
sites. Accordingly, prepared benchmark can be further
Module No. 5 Identified & blocking of terrorist
used for identification of other sites of terrorist/extremist
websites
groups. And by using website blocker all the identified
The Graph No. 2 shows the benchmark of
terrorist websites. A benchmark has been devised from
the data and its mean, standard deviation and confidence
interval. A new website can be checked against this
benchmark to decide its primary inclusion in the
terrorists/extremists list, for further analysis like
terrorist websites will be blocked. Also by using
business system planning clustering algorithm we can
create the cluster of these terrorist websites which is
valuable for visualization and analysis of hidden
domestic terrorist communities and intercommunity
relationships among all web sites. This project is useful
in
clustering and blocking
understanding
recent
change
terrorist/extremist’s use of the web.
of
pattern
in
Future work is
Blocking websites on the basis of benchmark
focused on to explore more advanced machine learning
evaluation, A website blocker was written in order to
technique to detect technology and media usage pattern
crawl and block suspicious WebPages. It worked in
in terrorist web sites and to gain more insight into
these manner Crawl the entered website URLS,
terrorist usage. The more technical sophistication that
Download their content for analysis, Analyze the content
has been achieved by these groups underlines need for
and calculate various attributes, Group the attributes in 9
more study in the area of pattern recognition, content
groups as stated earlier, Check every group of attributes
determination and web interaction phenomenon by the
against the respective benchmark, Check if 70% match is
terrorists. The work can be extended to offer an
found (6 out of 7 attribute values fall in the range),
continuative adapting tool which will understand the
Block those URLs out of entered URLs which fail the
web usage on periodic basis and revise the benchmark
benchmark test
accordingly in order to determine threat from the
terrorist websites
REFERENCES
ISSN: 2231-5381
http://www.internationaljournalssrg.org
Page 469
International Journal of Engineering Trends and Technology- Volume4Issue3- 2013
[1]Jialun Qin, Yilu Zhou, Edna Reid, Guanpi Lai,
Hsinchun Chen "Analyzing terror campaigns on the
internet: Technical sophistication, content richness, and
Web interactivity", International journal of human
computer studies, Nov 2006.
[2]Hsinchun Chen, Sven Thoms, T. 1. Fu, "Cyber
Extremism in Web 2.0: An Exploratory Study of
International Jihadist Groups", IEEE International
Conference on Intelligence and Security Informatics,
2008.
[3]Michael Chau, Jennifer Xu, "Mining communities
and their relationships in blogs: A study of online hate
groups", International journal of human computer
studies, Oct 2006.
[4]Peter A. Gloor, Jonas Krauss, Stefan Nann Kai
Fischbach, Detlef Schoder, "Web Science 2.0:
Identifying Trends through Semantic Social Network
Analysis", International conference on computational
science & engineering, 2009.
[5]Chen, H., Qin, J., Reid, E., Chung, W., Zhou, Y., Xi,
W., Lai, G, "The Dark Web Portal: Collecting and
Analyzing the Presence of Domestic and International
Terrorist Groups on the Web", Proceedings of
International IEEE Conference on Intelligent
Transportation Systems 2004.
[6]Shohreh Ajoudanian, and Mohammad Davarpanah
Jazi "Deep Web Content Mining" World Academy of
Science, Engineering and Technology 492009
[7]Jialun Qin, Yilu Zhou, Edna Reid, Guanpi Lai,
Hsinchun Chen “Us Domestic Extremist Groups on the
Web: Link and Content Analysis”, IEEE intelligent
system October / September 2005”
[8]Sanjeev Sharma and R. K. Gupta “ Improved BSP
clustering Algorithm for Social Network Analysis”,
International journal of grid and Distributed Computing
Vol. 3, September,2010
ISSN: 2231-5381
http://www.internationaljournalssrg.org
Page 470
Download