COPLINK, Dark Web, and Hacker Web: A Research Path in Security Informatics Dr. Hsinchun Chen Artificial Intelligence Lab, University of Arizona Acknowledgements: NSF, DOJ, DOD SECURITY INFORMATICS Leaderless Jihad and the Internet • • • “The process of radicalization in a hostile habitat but linked through the Internet leads to a disconnected global network, the Leaderless Jihad.” Before 2004, face-to-face interactions, 26-year old After 2004, interactions on the Internet: Madrid, Dutch Hifsatd, Cairo, Toronto… Irhabi007 and Muntada, 20year old Intelligence and Security Informatics Intelligence and Security Informatics (ISI): Development of advanced information technologies, systems, algorithms, and databases for national security related applications, through an integrated technological, organizational, and policybased approach” (Chen et al., 2003a) Data, text, and web mining From COPLINK to Dark Web A Knowledge Discovery Framework for ISI 5 COPLINK COPLINK Database and Schema OBJECTS Number of Documents PK 2,500,000 OBJECTPK OBJECTTYPE OBJECTDESC PERSONS L_EYECOLTYPES 150,000 Reports Pawn Tickets 65,000 45,000 Warrants Field Interviews PK PK,FK5 PERSONPK COLORTYPE COLORCODE COLORDESC COLORRANK FK4 FK2 L_HAIRCOLTYPES Number of Data Objects 1,300,000 PK COLORTYPE COLORCODE COLORDESC COLORRANK 420,000 400,000 85,000 Person 7 Property Vehicle Organization 39,000 Weapon FK1 FK3 REALNAME DOB RACE GENDER MINDOB MAXDOB MINHEIGHT MAXHEIGHT MINWEIGHT MAXWEIGHT EYECOLOR HAIRCOLOR GANGFLAG CAUTIONFLAG WANTEDFLAG PAWNERFLAG FBIID SID LOCALID FNGRPRTID DNAID PHOTOFILENAME PHOTOIMAGE L_RACETYPES PK RACETYPE RACECODE RACEDESC RACERANK L_GENDERTYPES PK GENDERTYPE GENDERCODE GENDERDESC GENDERRANK The COPLINK System: Crime Data Mining 8 COPLINK Identity Resolution and Criminal Network Analysis Cross-jurisdictional Information Sharing/Collaboration Arizona IDMatcher Law-enforcement Data AZ CA CAN Visualizer TX Border Crossing Data (AZ, CA, TX) Vehicles Identity Resolution DOB Match Criminal Network Analysis High-risk Vehicle Identification Identity Match Name Match People Address Match ID Match Law-enforcement Data Criminal Link Prediction Suspect Traffic Burst Detection Border Crossing Data Narcotics Network Mutual Information Vehicle A Vehicle B 2000 Time of Day ID Similarity 1500 1000 500 0 Jun 9 June 17 Mar 5 Mar 5 May 18 May 18 May 25 May 28 Dates May 30 Jan 6 Jan 15 Jan 19 Jan 26 Jan 31 < 2004 Feb 27 Nov 17 Dec 19 Dec 21 Address Similarity Dec 29 DOB Similarity Jan 6 Last Name Match Jan 6 Middle Name Match Nov 11 First Name Match 2005 > Frequent Crossers at Night First Name Similarity Middle Name Similarity Last Name Similarity Detect false and deceptive identities across jurisdictions using a probabilistic naïveBayes based resolution system. Vehicle A Vehicle B Identify high-risk vehicles using association techniques like mutual information using border crossing and law enforcement data. Predict interaction between individuals and vehicles using link prediction techniques to identify high-risk border crossers. * Only the grayed datasets are available to the AI Lab Detect real-time anomalies and threats in border traffic using Markov switching and other models. 9 A Four-layer Naïve-Bayes Model for Identity Resolution Identity Match Name Match • First Name Match Middle Name Match Last Name Match First Name Similarity Middle Name Similarity Last Name Similarity DOB Match Address Match ID Match DOB Similarity Address Similarity ID Similarity A multi-layer structure is able to model complex attribute dependencies. 10 Evaluation Results: AZ IDMatcher vs. IBM IR (NORA, Jeff Jonas) Gang subset AZ IDMatcher IBM IR Narcotics subset AZ IDMatcher Number of records 4,023 31,978 Identities in gold standard 2,420 16,977 IBM IR 2,618 2,846 14,363 15,690 34.92% 29.25% 55.08% 50.93% Precision 0.99 0.99 0.88 0.89 Recall 0.95 0.91 0.95 0.92 F-measure 0.97 0.95 0.91 0.90 Completion time 34S 5M32S 3M38S 45M39S Identities found by system Compression ratio 11 High-risk Vehicle Identification © 2006 Google – Imagery © 2006 DigitalGlobe, Map data ©2006 NAVTEQTM Port of Entry (Check points) Vehicle lanes Turn-out points Turn-out points 12 A Vehicle Pair Identified by MI Tucson met. area – Narcotics Network Customs and Border Protection Pima County Criminal Network MI Vehicle A Vehicle B 2000 1500 1000 500 0 Feb 7 Feb 6 Jan 29 Jan 26 Jan 25 Jan 15 Frequent Crossers at Night Vehicle C Vehicle D 13 A Vehicle to Watch? Shape Indicates Object Type circles are people rectangles are vehicles Color Denotes Activity History Gang related Violent crimes Narcotics crimes Violent & Narcotics Larger Size Indicates higher levels of activity Border Crossing Plates are outlined in Red 14 COPLINK project in the press The New York Times, November 2, 2002 COPLINK assisted in DC sniper investigation ABC News April 15, 2003 Google for Cops: Coplink software helps police search for cyber clues to bust criminals Newsweek Magazine, March 3, 2003 A computerized way for police to coordinate crime databases Washington Post, March 6, 2008, COPLINK in use in 3,500 police agencies in US! COPLINK merged i2 (Silver Lake) in 2009; i2/COPLINK acquired by IBM in 2011 for $500M COPLINK R&D Summary COPLINK research: data warehousing, information access, information sharing, association rule mining, mobile alert, spatiotemporal visualization, deception detection, border protection, criminal/dark network analysis and visualization COPLINK publications and graduates: 25 journal papers (MIS, ACM, IEEE); 30 conferences articles and chapters; 6 Ph.D. students, 40 MS students, 10 BS students COPLINK federal funding ($4M): NIJ/DOJ (1997-2000), BJA/TPD (20002003), NSF DLI (2003-2007) COPLINK commercialization: UA technology transfer and KCC founding (1999); venture funding ($4.6M, 2000 & 2003); customer sales ($30M); Silverlake/I2/IBM acquisition (2009, 2011; $420M) COPLINK impacts: 3,500 US agencies; top-ten police agencies; NATO; case closure and investigation efficiency (10 fold improvement) 16 Pain, Sorrow, and Regret Loss of family time/life (but never money) Managing university obligations and COI University bureaucracy, Office of Technology Transfer (OPTT) Lawyers, accountants are expensive Chasing angels/VCs (40 frogs 1 prince) Office, employees, products Selling products (becoming a vendor) Burning cash Bubble burst Raising second round funding when you are down ($2M) Board room yelling matches University accusations Losing control and shares Anti-dilution clause (losing $60M for the $2M you never used) 17 DARK WEB SOCIAL MEDIA ANALYTICS, DEEP WEB SPIDERING, WEB LINK ANALYSIS, WEB METRICS ANALYSIS, MULTILINGUAL AFFECT ANALYSIS, AUTHORSHIP ANALYSIS, MULTIMEDIA ANALYSIS, TEXT VISUALIZATION, DYNAMIC SNA, SIR MODELING, DARK WEB PORTAL, GEOPOLITICAL WEB PORTAL Dark Web Overview Dark Web: Terrorists’ and cyber criminals’ use of the Internet Collection: Web sites, forums, blogs, YouTube, Second Life Analysis and Visualization: Link and content analysis; Web metrics analysis; Authorship analysis; Sentiment analysis; Multimedia analysis Our collection is about 20 TBs in size, with close to 10B pages/files/messages from more than 10,000 Dark Web sites. Dark Web project in the press Project Seeks to Track Terror Web Posts, 11/11/2007 Researchers say tool could trace online posts to terrorists, 11/11/2007 Mathematicians Work to Help Track Terrorist Activity, 9/14/2007 Team from the University of Arizona identifies and tracks terrorists on the Web, 9/10/2007 Dark Web, Springer, 2012 22 chapters, 451 pages, 150 illustrations (81 in color); Springer Integrated Series in Information Systems, 2012. Selected TOC: • Forum Spidering • Link and Content Analysis • Dark Network Analysis • Interactional Coherence Analysis • Dark Web Attribution System • Authorship Analysis • Sentiment Analysis • Affect Analysis • CyberGate Visualization • Dark Web Forum Portal • Case Studies: Jihadi Video Analysis, Extremist YouTube Videos, IEDs, WMDs, Women’s Forums ALGORITHMS Dark Web Forum Crawler System: Probing the Hidden Web CyberGate for Social Media Analytics: Ideational, Textual and Interpersonal Information 24 System Design: CyberGate Language Features Resource Category Feature Groups Language Lexical Word Length 20 word frequency distribution Letters 26 A,B,C Special Characters 21 $,@,#,*,& Digits 10 0,1,2 Function Words 250 of, for, the, on, if Pronouns 20 I, he, we, us, them Conjunctions 30 and, or, although Prepositions 30 at, from, onto, with Punctuation 8 !,?,:,” Document Structure 14 has greeting, has url, requoted content Technical Structure 50 file extensions, fonts, images Sentiment Lexicons 3000 positive, negative terms Affect Lexicons 5000 happiness, anger, hate, excitement Syntactic Structural Lexicons Process Lexical Quantity Examples Word-Level Lexical 8 % char per word Char-Level Lexical 7 % numeric char per message Vocabulary Richness 8 hapax legomana, Yules K, Syntactic POS Tags Content-Based Noun Phrases Varies account, bonds, stocks Named Entities Varies Enron, Cisco, El Paso, California Bag-of-words Varies all words except function words Character-Level Varies aa, ab, aaa, aab Word-Level Varies went to, to the, went to the POS-Level Varies NNP_VB VB,VB ADJ Digit Level 1100 N-Grams 2200 NP_VB 12, 94, 192 25 Arabic Writeprint Feature Set: Online Authorship Analysis Feature Set (418) Violence Race/Nationality Technical Structure Word Structure Word Roots Function Words Punctuation Word-Based Char-Based Hyperlinks Embedded Images Font Size Font Color Contact Information Paragraph Level Message Level Elongation Word Length Dist. Vocab. Richness Word-Level Special Char. Letter Frequency Char-Level (7) (8) (4) (29) (3) (6) (5) (8) (15) (2) (6) (9) (35) (4) (4) (11) (48) (14) (50) (200) (12) (31) (48) (15) (62) (262) (79) Content Specific Structural Syntactic Lexical Arabic Feature Extraction Component 1 Incoming Message 2 Count +1 Elongation Filter Degree + 5 Filtered Message Feature Set Similarity Root Dictionary 3 Scores (SC) max(SC)+1 Root Clustering Algorithm All Remaining Features Values Generic Feature Extractor 4 System Design: Writeprints Writeprint Technique Steps 1) Derive two primary eigenvectors (ones with the largest eigenvalues) from feature usage matrix. 2) Extract feature vectors for sliding window instance. 3) Compute window instance coordinates by multiplying window feature vectors with two eigenvectors. 4) Plot window instance points in two dimensional space. 5) Repeat steps 2-4 for each window. 28 Evaluation: Writeprints Style Classification Results Writeprints outperformed SVM by 8%-10% for both experimental settings. The improved performance was statistically significant for 25 and 50 authors. Furthermore, the Writeprint accuracies for such a large number of authors are higher than previous studies (Zheng et al., 2006). Techniques # Authors SVM Writeprints Baseline 25 Authors 84.00 92.00 62.00 50 Authors 80.00 90.00 51.00 29 Author Writeprints Anonymous Messages Author A 10 messages Author B 10 messages System Design: Ink Blots Ink Blot Technique Steps 1) Separate input text into two classes (one for class of interest, one class containing all remaining texts). 2) Extract feature vectors for messages. 3) Input vectors into DTM as binary class problem. 4) For each feature in computed decision tree, determine blot size and color based on DTM weight and feature usage. 5) Overlay feature blots onto their respective occurrences in text. 6) Repeat steps 1-5 for each class. 31 Evaluation: Ink blots Topic Categorization Results Both techniques achieved accuracy over 90% in all instances. SVM significantly outperformed the Ink Blot technique for the 5 and 10 topic experiment settings. The higher performance of SVM was attributable to its ability to better classify the small percentage of messages that were in the gray area between topics. Techniques # Topics SVM Ink Blots Baseline 5 topics 95.70 92.25 88.75 10 Topics 93.25 90.10 86.55 32 CyberGate 33 CyberGate 34 Dark Web Forum Participant Network Analysis Data Acquisition Social Network Extraction Dark Web Forum Thread Pages Time-dependent Feature Extraction Time-series Analysis Content-based Features ARX Model Lexical Features Graph-based Features Parsing Time Spell Construction 1 Parsed Forum Data • Degree Centrality Importance Individual-based Features 2 3 4 5 User Characteristics User Behaviors Explanatory Variables Avg. Len. of Postings. Freeman Betweeness / PageRank Score / HITS Score In Degree / Out Degree Num. of Postings. Avg. Num. of Postings. per Thread Avg. Len. of Threads Posting Violence Level (t-1) Dependent Variables Posting Violence Level (t) Violence level of a users is very stable and is hard to be influenced by other users. Users who spend longer time in the Dark Web forum become more violent in their discussion. 35 SIR Infection Model for Dark Web Forums s (t ) S (t ) I (t ) i (t ) S (t ) I (t ) I (t ) r (t ) I (t ) dS s (t ) at time t dt dI i (t ) at time t dt dR r (t ) at time t dt Violent Topics S(t) : the number of susceptible authors at time t R(t) : the number of recovered authors a time t I(t) ; the number of infective authors at time t Suicide Bomb, R-square=0.7036 36 Infection rate α=0.0002; β=0.03 SYSTEMS AZ Forum Spider Collection – AZ Forum Spider Automated collection of forum communications; weekly update Proxy servers and parameters Site map, URL ordering, and forum extraction Incremental spider Collection visualization Forum List Spidering Status Collection Statistics Spidering Profile Analysis – AZ CyberGate Text Analyzer Comprehensive system for the analysis and visualization of forum communications Shows all text features Utilizes Writeprint and Ink Blot techniques in text analysis Incorporates rich visualization based upon multidimensional scaling and parallel coordinates Authorship Heatmap 40 Authorship Comparison Radar Chart 41 AZ Forum Portal Dark Web Forum Portal Current version: 13M messages (340K members) across 29 major Jihadi forums in English, Arabic, French, German and Russian (VBulletin) Forum analysis By forum, thread, member, time period, or topic Social network analysis and visualization Google Translation Dark Web Video Portal Video-sharing websites have also be found to be utilized by extremist groups. Example: Preparing explosives However, most of video-sharing websites lack of an automatic approach to identify illegal, offensive, and terrorism-/extremismrelated videos (Dark Videos) from their huge video collections. YouTube only provides the “flag” mechanism for users to mark inappropriate videos. In addition, identifying and collecting Dark Videos is also important for the Dark Web research community Video Portal System Functionalities Video statistics analysis Video statistics, like top video authors and trends of videos uploaded per day, are displayed in 2D graphs. Basic statistics of a collection Trend of video comments Top video authors 44 GeoPolitical Web: Predicting Arab Spring? (cyber real world) Region/Country Language (in order of importance) Afghanistan Dari Persian, Pashto, English Indonesia Indonesian, English Iraq Arabic, Kurdish, South Azeri (“Turkmen”) Maghreb (Algeria, Libya, Arabic, French, English Mauritania, Morocco, Tunisia) Somalia Arabic, Somali, English Yemen Arabic, English AFRICAN COUNTRIES: Somalia and Maghreb region (Morocco, Algeria, Tunisia, Libya, Mauritania) MIDDLE EAST: Yemen, Afghanistan, Iraq SOUTHEAST ASIA/OCEANIA: Indonesia GeoPolitcal Web System Design Economic Information Political Information Cultural Information Data Sources Forum Blog Mass Media Twitter News World Bank Spider IMF UN Economist Manually Collect Data Representation Representation/Integration Sentiment Topic Time Series Social Network Economic Metrics Political Metrics Predicting Geopolitical Risks Visualization Analytic Approaches Analytic Approaches Cultural Metrics Interactive Applications Data Collection Social Media Static Figures/Dashboards Information Categories Information Categories 46 GeoPolitical Web Data Collection Summary Social Media Scope Forums Coll. Method Quantity Wide discussion on Automated universal topics (crawlers) Postings are organized by threads (subjects) Have collected 70 forums to date, with 26M messages from 3.3M threads Database with parsed forum text content is currently over 30GB Collection of raw forum HTML files spans multiple terabytes Additional forums identified and are soon to be collected 70 Forums in 14 countries: • Yemen – 10 • Iraq – 8 • Somalia – 7 • Afghanistan – 4 • Indonesia – 4 • Algeria – 4 • Egypt – 5 • Jordan – 5 • Lebanon – 4 • Morocco – 4 • Pakistan – 6 • Saudi Arabia – 5 • Tunisia – 5 Time Span Languages Earliest/Latest: 6 languages: English Arabic French Indonesia Pashto Urdu 10/02 – 05/12 09/02 – 06/12 01/01 – 05/12 07/02 – 05/12 02/00 – 06/12 11/05 – 05/12 01/05 – 05/12 02/00 – 05/12 05/08 – 06/12 06/06 – 06/12 07/04 – 05/12 04/03 – 06/12 06/01 – 06/12 01/05 – 05/12 47 Select Forums by Country Choose Browse Forums from the main page. Click the country of interest to see forums; here, Algeria is being selected. Descriptive information listed for each forum includes: • Forum name • Predominant language • Numbers of threads and messages • Forum start and end dates • Forum URL 48 Browse Forums by Thread Browsing by threads in the original language, with the threads organized by number of posts. The threads are translated into English via Google Translate. 49 Search Using Quick Search Type search terms into Quick Search box; terms are automatically translated to all supported languages and searched. Matching threads are returned in ranked relevance order, grouped by language. 50 HACKER WEB HACKER COMMUNITY EXPLORATION, FORUM COLLECTION, IRC CHANNELS, BOTNETS C&C, HONEYPOTS, SOCIAL MEDIA ANALYTICS, MALWARE ANALYSIS & ATTRIBUTION Hacker Web Overview (NSF SaTC, SFS, PI: Chen, Goes) • • Secured & Trust-worthy Cyberspace (SaTC), $1.2M: cybercrime attribution Scholarship for Service (SFS), $2.7M: UA/MIS MS NSA-CAE Cyber Security Certificate Hacker Web System: Collection & Analytics Hacker Reputation Attribution Community Name Language # of Messages # of Users Forum Start Date Hackhound.org English 77,061 5,794 October 9, 2008 Unpack.cn Chinese 646,494 22,743 October 12, 2004 Both allow for the unique feature for hackers to attach hacking tools and program source code to their messages for others to use Additionally, both communities allow hackers to assign each other a reputation score in order to rate one another’s usefulness and trustworthiness 54 Research Testbed Hackhound.org Hacking tool interface Description of code functionality Hacker’s Reputation Score Attached Hacking Tool Embedded sample of code Unpack.cn Left: A cybercriminal on hackhound.org publishes the latest version of his hacking tool meant to help others steal cached passwords on victims’ computers. Right: A hacker of the Chinese community Unpack.cn posts sample code demonstrating how to reverse engineer software written in the Microsoft .NET framework 55 Research Design 3a. Average Message Length Calculation WWW 3b. Thread Response Frequency Calculation 1. Hacker Community Collection 3c. Thread Involvement Calculation 3. Feature Extraction 3d. User Tenure Calculation 3e. Total Message Attachment Calculation 2.Content Extraction 3f. Total Message Volume Calculation 3g. Hacker Reputation Calculation 4. Regression Analysis Results: Threads, Attachments, Total Messages Hackhound.org Estimate Std. Error T value Average_Message_Length -0.0083 0.0025 -0.968 Number_Of_Replies_Per_Thread 0.0188 0.0616 0.305 Number_Of_Threads_Involved 0.1689 0.0538 2.822 ** Tenure 0.0041 0.0123 0.526 Sum_Of_Attachments 0.2786 0.1437 5.323 *** Total_Messages 0.3396 0.0379 6.554 *** ***p ≤ 0.001 ** p ≤ 0.01 * p ≤ 0.05 Unpack.cn Estimate Std. Error T value Average_Message_Length 0.0052 0.0027 0.125 Number_Of_Replies_Per_Thread 0.0372 0.0040 0.528 Number_Of_Threads_Involved 0.1403 0.0033 1.914* Tenure -0.0086 0.0135 -0.144 Sum_Of_Attachments 0.3805 0.1991 4.757*** Total_Messages 0.2838 0.0252 3.714** ***p ≤ 0.001 ** p ≤ 0.01 * p ≤ 0.05 57 ShadowServer and Botnets Attribution System Overview System Design - Criminal Clustering Within the IRC dataset, ~4000 identified human nicknames found hiding amongst ~3600 IRC C&C channels Criminals are found associated with specific C&C channels, and these linkages form a bipartite graph Criminals in the botnet underworld are not lone entities. They may collaborate with others and form alliances C&C channels are not isolated crimes. Many C&C servers may be operated by the same individuals or groups Most gangs maintain several C&C channels For incredibly large botnets, this distributes the communications load Provides redundancy. Should law enforcement take down a single C&C server, not all of the drone army is lost Want to cluster collaborating individuals together into groups of criminal gangs and C&C assets (sub-bigraphs) Crimes detected in individual C&C channels can be considered in aggregate amongst an entire criminal syndicate The Criminal Network The Criminal Network Sample Criminal Clusters Gang Members [0]USA--2KSP3[Om]824584 creature edzy fri frioz wejbwfe wloo BlaCkD3v—L # C&C Channels Bot Population # DDoS Targets # Pstore Thefts 51 235713 1263 bill gu3sT Besi D-PaLo hidden load process tonii Albania DaddyCooL[a] jelo jeloo [KleviS] Opium Silv3rArRoW waleed 88 30140 3310 44 252969 730 ILGuardiano liga MArian0z PepP0z JuMp xRaZoRx xBreaKx xxDCxx vDCv xGoDx xCKx xBeNx xBrandoNx xAmplifyx xSKYx xToaDx xTiMx Max hans matrix toxic abc Peter bob home Andy dan Jack blbla billy mark xxx sss mr StRuGaNi007 bostss Heropos niggaz yeste Pacino NhG Ld fada pilz AsC [a] bAcaRdI dRiVeR alejandro mut hook Dritton ArditS Corrupted 56 256193 6698 15 6094 2350 15 303286 2615 32 220703 730 12 239708 479 Attacker hh 1988 4484 For more information hchen@eller.Arizona.edu http://ai.Arizona.edu