International Journal of Engineering Trends and Technology- Volume4Issue3- 2013 A Novel Approach for Recognized & Overcrowding of Terrorist Websites Prof. G. A. Patil Computer science & engineering department, D. Y. Patil College of Engineering & Technology, Kolhapur, Maharashtra, India Prof. K. B. Manwade Computer science & engineering department, Ashokrao Mane Group of Institution, Vathar tarf Vadgaon, Maharashtra, India Mr. P. S. Landge Computer science & engineering department, D. Y. Patil College of Engineering & Technology, Kolhapur, Maharashtra, India Abstract-The extremists mainly utilize the Internet to Keywords: -Web content analysis, Web usage analysis, enhance their information operations surrounding Web collection building, Social Network. propaganda, communication, and psychological warfare. Islamic militant organizations, such as Al Quaeda, Hamas, Hezbollah, etc., have intensively utilizing the Internet to disseminate their anti-Western, anti-Israel propaganda, I. been provide training materials to their supporters, plan their INTRODUCTION Studying the sophistication of global extremist organizations’ Web presence would allow us to better understand extremist organizations’ technical operations, and raise funds by selling goods through sophistication, their access to information technology their Web sites; limit our experiential understanding related resources, and their propaganda plans. of their Internet usage. To address this research gap, However, due to the covert nature of the Dark Web we explore an integrated approach for identifying and the lack of efficient automatic methodologies to and collecting terrorist/extremist Web contents and monitor and analyze large amount of Web contents. to discover hidden relationships among communities. It has been shown in the literature that content analysis gives more insight of technical sophistication, content richness; whereas the link analysis focuses on the web interactivity. We Terrorist organizations have generated thousands of Web sites that support psychological warfare, fundraising, recruitment, coordination, and distribution of propaganda materials. The level of introduced a quantitative Dark Web content analysis technical sophistication of the Islamic terrorist tool called the Dark Web Attribute System (DWAS) organizations' Web sites has increased according to and tested it by applying the DWAS in the study of Katz, who monitors Islamic fundamentalist Internet the extremist organizations’ Internet usage. activities. The rapid proliferation and increased This present work focuses on identifying & sophistication of Web sites and online forums run by analyzing new web page attributes. It is aimed to terrorist/extremist organizations are indications of the compare different terrorist/extremist sites with genuine sites and accordingly prepare metrics which can be further used for identification of other sites of terrorist/extremist groups. Further the attempt to growing popularity of the Internet in terrorism campaigns. They also indicate that there is a vast pool of sympathizers that such organizations have visualize and analyze hidden domestic terrorism attracted, with some applying their IT expertise as communities contributions to the cause. and intercommunity among all web sites in our collection. relationships The Web has evolved towards multimedia-rich content ISSN: 2231-5381 delivery, end http://www.internationaljournalssrg.org user personal content Page 463 International Journal of Engineering Trends and Technology- Volume4Issue3- 2013 generation, and community-based social interactions. technologies in the global terrorism phenomena. Due to the freedom and convenience of publishing in DWAS is used to visualize and analyze hidden Weblogs, [2] [3] this form of media provides an ideal domestic terrorism community and intercommunity environment as a propaganda platform for extremist [7] relationship. The DWAS helps in identifying the or terrorist groups to promote their ideologies. groups that are considered by authoritative sources as Criminals may also make use of the virtual terrorist/extremist environment to organize crimes such as money government agency reports, authoritative organization laundering and drugs trafficking without being easily reports and studies published by terrorism research identified. As a result, it is important to understand centers. Also DWAS identify a set of seed terrorist the social network of the bloggers in order to assess group URLs from the authoritative sources and the the risks that may threaten the national security. terrorism keyword lexicon to query major search groups. The sources include engines on the Web. After identifying the seed URLs, II. out-links and in-links of the seed URLs were LITERATURE SURVEY In social networks analysis the main task is automatically extracted using link-analysis programs. usually about how to extract social networks from Once the terrorist/extremist Web sites are identified, a different communication resources [4]. The data used program is used to automatically download all their for building social networks is relational data, which contents. The DWAS framework focuses on the can be obtained and transferred from different attributes that could help us better understand the resources including the web, email communication, level of advancement and effectiveness of terrorists' Internet relay chats, telephone communications, Web organization and business events, etc.In recent years, attributes, content richness attributes (an extension of there have been studies of how terrorists use the Web the traditional media richness attributes), and Web to facilitate their activities. The first step towards interactivity attributes. studying terrorists' tactical use of the Web is to build a usage, namely, technical sophistication Still DWAS have scope for improvement in identifying and analyzing new attributes for content high-quality Dark Web [5] collection. The rapid expansion of the web is causing the constant growth of information, leading to several analysis, applying new data mining algorithm for link analysis as suggested in [1][7][8]. problems such as increased difficulty of extracting potentially useful knowledge. Web content mining [6] confronts this problem gathering explicit information III. MODIFIED DARK WEB ATTRIBUTE SYSTEM FOR COUNTER TERRORISM from different web sites for its access and knowledge The aim of this proposed work is, to identify the discovery. Web mining is concerned with the use of terrorists name and their web sites and then download data mining techniques to automatically discover and the web site contents for analysis purpose. The extract Web various steps involved in this are identifying terrorist documents and services. Web content mining leader name, identifying terrorist group URLs and approach to extract information from web based expanding terrorist URL set through link and forum databases. analysis. Also there is a need to cluster the related information from World Wide The DWAS is an effective tool to analyze the websites and define new set of Web interactivity technical sophistication of terrorist/extremist groups' attributes for calculating the web content and to Internet usage and could contribute to an evidence understand based understanding of the applications of Web effectiveness of terrorists' Web usage. These new set ISSN: 2231-5381 the level http://www.internationaljournalssrg.org of advancement and Page 464 International Journal of Engineering Trends and Technology- Volume4Issue3- 2013 of attributes are shown in Table No. 1 it consist of dark Web Attribute system will have the following nine high level attributes, each of which is composed modules, of multiple fine grained low level attributes. High level Low level attributes Description Weight attributes Menu The use of menu tag for design the websites 2 Technical Meta The use of meta tag for design the websites 2.5 Sophistication Style The use of style tag for design the websites 1 attribute label The use of label tag for design the websites 2.5 Span The use of span tag for design the websites 3 Form The use of form tag for design the websites 1.5 Fundamental Frame The use of frame tag for design the websites 2 attribute Table The use of table tag for design the websites 2 Advanced technical Java script Use of java script language 4 sophistication Script Use of script language 4 Java Use of java language 2.5 PHP Scripting language designed for Web development 5 attribute Dynamic web programming to produce dynamic Web pages ASP Use of Active Server Pages (ASP) 5.5 Flash Banner depicting representative figure, graphical 1 representative figure, graphical 1 Content Richness symbol or seal Image Banner depicting symbol or seal Audio Short phrase with religious or ideological 1 connotation Video Video on religion, attack etc. 1 Communication List List of leader name, address etc. 2.3 (User generated Contact Telephone number 1.2 content) Email Email address 2.5 Comment Allow the user to give feedback or ask question to 2.4 Online the site owner or maintainer Organizational Videoconference Video clip of bombings, game, animated picture, etc. 3.3 Online recruitment Invitation to join or attend meeting, interview etc. 4.5 E-tendering attributes Invitation & publish the E-tendering information 4.5 attribute Web interactivity attribute The proposed dark web attributes system architecture as shown in figure No.1 The proposed Table no. 1 Table No.1 Attributes used in the content analysis Module No.1 Dark web collection ISSN: 2231-5381 Identify the terrorist name and URLs of http://www.internationaljournalssrg.org Page 465 International Journal of Engineering Trends and Technology- Volume4Issue3- 2013 terrorist groups from dark web. Then using link using an automatic web crawling toolkit called analysis program automatically extract the URL out- spidersRUs one can download the entire web link and in-link. The robust filtering method will be document within these sites. applied to identify essential terrorist websites. Then, Figure No.1. Proposed Dark Web Attribute System Architecture Module No.2 Content analysis of terrorist web sites ( ) ∑ ( ) Module No.3 Link analysis of terrorist web sites After downloading the entire web To find relationship among different web sites document within these sites, Apache POI (well-known in for the same group and the interaction with other the Java field) is used to read the entire web document extremist group, first step is to calculate similarity and then write in excel sheet format. There are twenty between all web site pair in the collection. Similarity can seven types of attributes that are selected for analysis be defined as real value multivariable function of the purpose as shown in table no. 1. These attributes are number of hyperlink in web site A pointing to web site B assigned the weight values. Then calculate the weight of and the number of hyperlink in site B pointing to site A. that attribute. Finally find out benchmark comparison Hyperlink appearing at sites homepage has a higher result between terrorist websites and genuine websites. weight than hyperlink appearing at a deeper level. The When data from all websites belonging to a cluster is similarity between web site A and B will be calculated aggregated and the normalized content level is calculated by using formula given in equation (1). into six dimensions. Each dimension represents a ( ) ∑ normalized activity scale between 0 and 1, showing the degree of activity on the dimension. The activity scale of cluster c on dimension d is calculated by the following formula where n is the total number of attributes in ……(1) ( ) Where lv (L) is the level of link L in the web sites hierarchy, with the homepage as level 0, and each lower level in the hierarchy is increased by one. dimension d, while m is the total number of web sites belonging to cluster c. ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 466 International Journal of Engineering Trends and Technology- Volume4Issue3- 2013 Module No. 4 Clustering of terrorist web sites The Business system planning (BSP) The relative standard deviation (RSD) is often times more convenient. It is expressed in percent and is clustering algorithm is used to form the clusters of obtained by multiplying the standard deviation by 100 terrorist web sites. Based on the Initial results, the seed and dividing this product by the average. URL’s were crawled for finding clusters amongst the dark web. The crawling was done for two more iterations, so as to get one level of direct association and one level of indirect association between the websites. The linking has been shown with the nodes and edges graph, which shows the name of the nodes being connected by a particular edge. The first iteration, helped in identifying the associations between the dark web URL’s whereas second iteration helped differentiate the clusters based on two step associations. Confidence Interval is a particular kind of interval estimate of a population parameter and is used to indicate the reliability of an estimate. It is an observed interval (i.e. it is calculated from the observations), in principle different from sample to sample, that frequently includes the parameter of interest, if the experiment is repeated. How frequently the observed interval contains the parameter is determined by the Module No. 5 Identified & blocking of terrorist websites confidence level or confidence coefficient. IV. The approach involves application of robust EXPERIMENTAL RESULT Module No.1 Dark web collection filtering methods to find essential websites and remove the unwanted websites, which would be useful for further analysis purpose. Extending and implementing the web crawler to download the entire web document within these sites, enables the parameter gathering. Different statistical calculations were carried out on the data, in order to finalize the benchmark. The name & URLs of terrorist groups are identified from Government report such as FBI, US State department and research centers MEMRI, ATC etc. Web crawler is further used to automatically extract the URL out-link and in-link. Module No.2 Content analysis of terrorist web sites The average result (Mean) X is calculated by summing the individual results and dividing this sum by the The content was downloaded for shortlisted terrorist number (n) of individual values: URLs. There are twenty seven types of attributes that were selected for analysis purpose. Five major attribute groups were formed as, Technical Sophistication attribute ( by grouping menu, meta, label, style, span, The standard deviation is a measure of how precise the average is, that is, how well the individual numbers agree with each other. It is a measure of a type of error called random error - the kind of error people can’t control very well. It is calculated as follows: form, frame and table attributes), Advanced technical sophistication ( by grouping java script and script attributes), Dynamic web programming ( by grouping Java, PHP, ASP attributes), Content Richness ( by grouping Flash, image, audio, video, no. of hyperlink, no. of download document attributes ), and Web Standard deviation, = interactivity attributes (by grouping list, contact, email, comment, videoconference, online recruitment and Etendering attributes). The graphs were plotted to study their relationship further. The Figure No. 2 shows the ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 467 International Journal of Engineering Trends and Technology- Volume4Issue3- 2013 attributes gathered and their calculations. The entire information is combined to form the total of each URLs attribute count. A consolidated sheet has been created which represents data for the entire group. Module No.3 Link analysis of terrorist web sites After identifying the seed URLs, We extracted out-links and in-links of the seed URLs using an automatic link-analysis programs. The expanded extremist URL set was them manually filtered by domain experts to ensure that irrelevant and bogus Web sites did not make way into our collection. After the filtering, a total of 1116 extremist group URLs were included in our final URL set. Figure no. 3 shows the link tree of terrorist websites Figure No 2 Snapshot of Attribute count for a terrorist website. For better analysis purpose the count of terrorist websites attributes is compare with genuine websites(Government websites US, India, Australia etc. & standard organization Infosys, TCS, Reliance etc ) The Graph No.1 shows clear difference between averages of both the groups. The mean count for genuine websites groups is higher for every attribute. The Terrorist/Extremist’s web usage is more concentrated on content richness. Other attributes like web interactivity, Figure No 3. Link Tree for terrorist websites Module No. 4 Clustering of terrorist web sites are less focused, which helps them in one way communication. Average usage of Basic/Advanced Technical Sophistication and Dynamic Web Based on the Initial results, the seed URL’s were crawled for finding clusters amongst the dark web. The crawling was done for two more iterations, so as to Programming get one level of direct association and one level of indirect association between the websites. The linking has been shown with the nodes and edges graph, which shows the name of the nodes being connected by a particular edge. The first iteration, helped in identifying the associations between the dark web URL’s where as second iteration helped differentiate the clusters based Graph No. 1 Comparison between averages of counts for all the attributes, for both the groups. on two step associations. The figure no. 4 shows the cluster of terrorist websites created by using business system planning clustering algorithm (BSP) ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 468 International Journal of Engineering Trends and Technology- Volume4Issue3- 2013 Graph No. 2 Benchmark of terrorist websites IV. CONCLUSION In this work identified the terrorist websites from anti terrorism research centre, FBI, Government etc. then downloaded the content of these websites. These Figure No. 4 Cluster of terrorist websites contents are compared with government (legitimate) sites. Accordingly, prepared benchmark can be further Module No. 5 Identified & blocking of terrorist used for identification of other sites of terrorist/extremist websites groups. And by using website blocker all the identified The Graph No. 2 shows the benchmark of terrorist websites. A benchmark has been devised from the data and its mean, standard deviation and confidence interval. A new website can be checked against this benchmark to decide its primary inclusion in the terrorists/extremists list, for further analysis like terrorist websites will be blocked. Also by using business system planning clustering algorithm we can create the cluster of these terrorist websites which is valuable for visualization and analysis of hidden domestic terrorist communities and intercommunity relationships among all web sites. This project is useful in clustering and blocking understanding recent change terrorist/extremist’s use of the web. of pattern in Future work is Blocking websites on the basis of benchmark focused on to explore more advanced machine learning evaluation, A website blocker was written in order to technique to detect technology and media usage pattern crawl and block suspicious WebPages. It worked in in terrorist web sites and to gain more insight into these manner Crawl the entered website URLS, terrorist usage. The more technical sophistication that Download their content for analysis, Analyze the content has been achieved by these groups underlines need for and calculate various attributes, Group the attributes in 9 more study in the area of pattern recognition, content groups as stated earlier, Check every group of attributes determination and web interaction phenomenon by the against the respective benchmark, Check if 70% match is terrorists. The work can be extended to offer an found (6 out of 7 attribute values fall in the range), continuative adapting tool which will understand the Block those URLs out of entered URLs which fail the web usage on periodic basis and revise the benchmark benchmark test accordingly in order to determine threat from the terrorist websites REFERENCES ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 469 International Journal of Engineering Trends and Technology- Volume4Issue3- 2013 [1]Jialun Qin, Yilu Zhou, Edna Reid, Guanpi Lai, Hsinchun Chen "Analyzing terror campaigns on the internet: Technical sophistication, content richness, and Web interactivity", International journal of human computer studies, Nov 2006. [2]Hsinchun Chen, Sven Thoms, T. 1. Fu, "Cyber Extremism in Web 2.0: An Exploratory Study of International Jihadist Groups", IEEE International Conference on Intelligence and Security Informatics, 2008. [3]Michael Chau, Jennifer Xu, "Mining communities and their relationships in blogs: A study of online hate groups", International journal of human computer studies, Oct 2006. [4]Peter A. Gloor, Jonas Krauss, Stefan Nann Kai Fischbach, Detlef Schoder, "Web Science 2.0: Identifying Trends through Semantic Social Network Analysis", International conference on computational science & engineering, 2009. [5]Chen, H., Qin, J., Reid, E., Chung, W., Zhou, Y., Xi, W., Lai, G, "The Dark Web Portal: Collecting and Analyzing the Presence of Domestic and International Terrorist Groups on the Web", Proceedings of International IEEE Conference on Intelligent Transportation Systems 2004. [6]Shohreh Ajoudanian, and Mohammad Davarpanah Jazi "Deep Web Content Mining" World Academy of Science, Engineering and Technology 492009 [7]Jialun Qin, Yilu Zhou, Edna Reid, Guanpi Lai, Hsinchun Chen “Us Domestic Extremist Groups on the Web: Link and Content Analysis”, IEEE intelligent system October / September 2005” [8]Sanjeev Sharma and R. K. Gupta “ Improved BSP clustering Algorithm for Social Network Analysis”, International journal of grid and Distributed Computing Vol. 3, September,2010 ISSN: 2231-5381 http://www.internationaljournalssrg.org Page 470