Macro Trends in Counter-Terrorism Technologies And Thoughts on Responsible Innovation DETECTER Project, Brussels September 7th, 2011 Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics JeffJonas@us.ibm.com 1 Today’s Material Background Macro Trends Detecting Bad Guys in Big Data Challenging Privacy and Civil Liberties Issues Privacy by Design (PbD) Considerations Questions and Answers 2 Background Early 80’s: Founded Systems Research & Development (SRD), a custom software consultancy 1989 – 2003: Built numerous systems for Las Vegas casinos including a technology known as Non-Obvious Relationship Awareness (NORA) 2001/2003: Funded by In-Q-Tel 2005: IBM acquires SRD 2005: Acquired by IBM, now Chief Scientist, IBM Entity Analytics Cumulatively: I have had a hand in a number of systems with multi-billions of rows describing 100’s of millions of entities 3 Roles Member, Markle Foundation Task Force on National Security in the Information Age Board Member, US Geospatial Intelligence Foundation (USGIF), the GEOINT organizing body Senior Associate, Center for Strategic and International Studies (CSIS) Member, EPIC advisory board Advisor, Privacy International 4 Current Primary Area of Interest Making sense of information in large data sets, across complex ecosystems with emphasis on privacy and civil liberties protections – 1996: Created an identity-centric customer repository based on 4,200 disparate systems … >100 million resolved identities – 2001: Assistance in various post-9/11 data analysis programs for public and private sector – 2005: Missing persons project following Hurricane Katrina resulting in re-unification of >100 loved ones 5 A Late Bloomer to Privacy 6 1980 – 2001 No clue whatsoever 2001 – 2006 Slowly waking up 2007 – 2011 Today, at best, a student of privacy A Journey Fraught with Reflection and Rethinking The greater my privacy and civil liberties awareness 7 The greater the number of imperfections appear in my rearview mirror Katrina – Missing Persons Reunification Project Information about status of persons quickly end up scattered across countless databases – Over 50 such web sites/organizations were identified as having victim related data – Many people were registered duplicate times in the same database – Many people were registered duplicate times across databases – Many people were registered as missing in one database and found in another database Connecting found persons previously reported as missing becomes nearly impossible – Too many databases – Constantly changing data 8 Katrina Reunification Project Statistics Total data sources Usable records 1,570,000 Unique persons 36,815 Total loved ones reunited 9 15 >100 Katrina – Missing Persons Reunification Project Privacy by Design (PbD) – Contractually authorized to delete all the data after the reunification office completed its work – Hence, a few months later, all collected data and reporting products were deleted DESTRUCTION OF EVIDENCE! Data Decommissioning – Destruction of Accountability 10 Macro Trends 11 Avg Age Good News: The World is Not More Dangerous 37 1900: Western Europe Today: Global Average Number Dead 67 75M ~17+% 300M ~4.5% 1300’s: “Black Death” 12 Today: If America sunk into ocean and everyone dies Prediction Your doctor is 102 and this is not weird. 13 Complexity of Execution Bad News: “More Death Cheaper in Future” Graph 10 Kiloton Nuke 1918 Spanish Influenza Death 14 1918 Spanish Influenza Genome 15 Complexity of Execution “More Death Cheaper in Future” Graph = Bad 10 Kiloton Nuke Easier 1918 Spanish Influenza More Death Death 16 Jerome Kerviel – US$7B www.chinapost.com.tw/news_images/20080127/p1d.jpg 17 Jerome Kerviel – US$7B Back it out Back it in Back it out Analytic Checkpoint Analytic Checkpoint 18 Back it in 1 Day 2050 Predictions A single person can kill 100M people for <$1,000. 19 State of the Union: Enterprise Amnesia 20 Amnesia, definition A defect in memory, especially resulting from brain damage. 21 US National Security Amnesia Events 9/11 Two known terrorists were admitted into the US (only discovered after the fact). Christmas Day Bomber Abdulmutallab possessed a multi-entry VISA while at the same time was on the terrorist watch list (only discovered after the fact). 22 Computing Power Growth Trend: Organizations Are Getting Dumber Every two days now we create as much information as we did from the dawn of civilization up until 2003.” Available Observation Space ~ EricContext Schmidt, CEO Google Enterprise Amnesia Sensemaking Algorithms Time 23 Computing Power Growth Trend: Organizations Are Getting Dumber Available Observation Space WHY? Context Sensemaking Algorithms Time 24 Algorithms at Dead End. You Can’t Squeeze Knowledge Out of a Pixel. 25 No Context scrila34@msn.com 26 Context, definition Better understanding something by taking into account the things around it. 27 Information without context is hardly actionable. 28 Lack of Context – Consequences Alert queues growing faster than the humans address – filled mostly with false positives The top item in the queue is not the most relevant item Items require so much investigative effort – they are often abandoned prematurely Risk assessment becomes the risk 29 29 Information in Context … and Accumulating scrila34@msn.com Job Applicant No Fly List 30 Most Trusted Source Known Terrorist The Puzzle Metaphor Imagine an ever-growing pile of puzzle pieces of varying sizes, shapes and colors What it represents is unknown – there is no picture Is it one puzzle, 15 puzzles, or 1,500 different puzzles? Some pieces are duplicates, missing, incomplete, low quality, or have been misinterpreted Some pieces may even be professionally fabricated lies Until you take the pieces to the table and attempt assembly, you don’t know what you are dealing with 31 32 Puzzling: 4 Puzzles, 620 Useful Pieces 270 pieces 90% 30 pieces 10% 200 pieces 66% 6 pieces 2% 150 pieces 50% 33 (duplicates) (pure noise) +36 Useless Pieces! 34 First Discovery 35 More Data Finds Data 36 Duplicates in Front Of Your Eyes 37 First Duplicate Found Here 38 39 40 Incremental Context – Incremental Discovery 41 6:40pm START 22min “Hey, this one is a duplicate!” 35min “I think some pieces are missing.” 37min “Looks like a bunch of hillbillies on a porch.” 44min “Hillbillies, playing guitars, sitting on a porch, near a barber sign … and a banjo!” 150 pieces 50% 42 Incremental Context – Incremental Discovery 43 47min “We should take the sky and grass off the table.” 2hr “Let’s switch sides, and see if we can make sense of this from different perspectives.” 2hr10m “Wait, there are three … no, four puzzles.” 2hr17m “We need a bigger table.” 2hr18m “I think you threw in a few random pieces.” 44 45 46 Trend: Big Data [in context] = New Physics More data: better the predictions – Lower false positives – Lower false negatives More data: bad data … good – Suddenly glad your data was not perfect More data: less compute 47 From Pixels to Pictures to Insight Contextualization Observations 48 Relevance Detection Persistent Context Consumer (An analyst, a system, the sensor itself, etc.) One Form of Context is “Expert Counting” Is it 5 people each with 1 account … or is it 1 person with 5 accounts? Is it 20 cases of H1N1 in 20 cities … or one case reported 20 times? If one cannot count … one cannot estimate vector or velocity (direction and speed). Without vector and velocity … prediction is nearly impossible. 49 Skilled adversaries engage in “channel separation.” 50 Cell Phone #1 Cell Phone #2 Bank Acct #1 Passport #1 Unknown Unknown Billy K. William A. Hence, detection requires “channel consolidation.” William A aka Billy K. • Cell Phone #1 • Cell Phone #2 • Bank Acct #1 • Passport #1 51 Expert Counting: Degrees of Difficulty Deceit Bob Jones Ken Wells 123455 550119 Incompatible Features Fuzzy Exactly Same Bob Jones 123455 52 Bob Jones 123455 Bob Jones 123455 Bob Jones 123455 Robert T Jonnes 000123455 bjones@hotmail Deceit Detection Using Context Accumulation Feature Accumulation Deceit Robert Jones 123455 POB 13452 DOB 03/12/73 Bob Jones POB 13452 gw3e56@hotmail.com 53 Bob Jones Ken Wells 123455 550119 Ken Wells 550119 POB 999911 DOB 03/12/73 gw3e56@hotmail.com gw3e56@hotmail.com DOB 03/12/73 Robert Jones 123455 Resolved! Ken Wells 550119 3 Models for Information Sharing 54 1. Bulk Transfer Large collections are passed along to appropriate third parties May be required if the recipient must commingle the data in secret The recipients must have a capacity much larger than their own native requirements The more copies the more difficult it is to maintain the information currency across the ecosystem The more copies the more difficult to prevent of unintended disclosure Useful when the number of recipients and transactional volumes are very small 55 2. Services for Inquiry Owners enable third party inquiry (human or machine lookups) When lots of systems are integrated, federated search can be automated to search all third party data sources based on a single user/machine search Each system in the federation must be sized for all volume Third party systems often lack the necessary indexes Nearly impossible to ensure each federated systems is on-line Useful for periodic, on-demand, inquiry using each third party data source like a reference system – particularly appropriate for narrow investigative work and/or forensic analysis Not that useful for detect/preempt missions 56 3. Central Catalog/Index Parties interested in information sharing supply metadata to a central catalog (index) Inquiries can discover the location of all available documents using a single lookup Card catalogs provide pointers to source systems and documents enabling efficient/scalable lookup (aka federated fetch) Easier to keep the data current … than bulk transfer Scales massively Easier to secure 57 Discovery at the Library ? Subject 58 Title Author Enterprise Discovery Who 59 What Where When How The Policy Focus Becomes … “Discoverability” If you don’t publish your meta-data (who, what, where, when) to the enterprise catalog … Information is not discoverable … Therefore, the value of your operational system to the broad strategic interests of the enterprise is effectively ZERO! 60 Are You Playing Well With Others? SHARING SCORECARD(*) DISCOVERABILITY Organization Records Discoverable % This org 5B 2.5B 50% That org 120B 6B 5% The other org 3B 1B 33% Their org 1B 750K 75% Their other org 1B 500K 50% (*) Any resemblance to real organizations and real number would be coincidental 61 Challenging Privacy and Civil Liberties Issues 62 Issue #1: Essential Secrets vs. Transparency To detect professionally fabricated lies, using only data, one must either: 1. Collect observations the adversary doesn’t know you have 2. Or, be able to perform compute over your observations in a manner the adversary cannot fathom The Challenge: How can organizations catch bad guys if there is transparency over their observational space and what is computable? 63 Issue #2: More Data Good The good news: Both those in the counterterrorism business and privacy community equally detest false positives – The government recognizes that false positives waste government resources – The privacy community recognizes that false positives place the innocent under undeserved government scrutiny The challenge: Two remedies for false positives 1. Change the rules to reduce the number of alerts (which increases the false negatives) 2. Add more information such that the additional context permits greater discrimination The more data, the lower the false positives and the lower the false negatives 64 Issue #3: Necessity of Central Indexes Federated search is extremely limited – Does not scale when the mission is to get “left of boom” (detection) Central card catalogs (indexes) are the only viable way forward – Only the metadata centralized with pointers, not all the data The Challenge: General reaction to central databases, even if just an index 65 Issue #4: Lone Gunmen Surveillance Rare events planned by one or a small group are more difficult to detect The size of the observation space needed to detect lone gunmen planning acts of terrorism … approaches ubiquitous surveillance Risk-based surveillance – A car bomb in a public place – A sector of national infrastructure at risk – WMD over a major city The Challenge: At some point when one person can create extraordinary damage, cheaply, without a trace … then what? 66 Issue #5: Less Secrets Lead to Chilling Effects? It is becoming harder and harder to have secrets Will this chill behavior? – Will population behavior gravitate towards the center of the bell curve? – Or, will mankind become more tolerant of diversity? 67 Privacy by Design (PbD) Considerations 68 Universal Declaration of Human Rights Article 9 No one shall be subjected to arbitrary arrest, detention or exile. Article 12 No one shall be subjected to arbitrary interference with his privacy, family, home or correspondence, nor to attacks upon his honor and reputation. Everyone has the right to the protection of the law against such interference or attacks. Article 15 (1) Everyone has the right to a nationality. (2) No one shall be arbitrarily deprived of his nationality nor denied the right to change his nationality. Article 17 (1) Everyone has the right to own property alone as well as in association with others. (2) No one shall be arbitrarily deprived of his property. 69 PbD: Information Attribution Avoid the receipt of any data that does not come with an ability to track its pedigree/attribution. When passing your data into secondary systems, pass the data pedigree/attribution along to the recipient (even if that means only a pointer to your copy). If the ‘chain of where data came from’ is not maintained in the information sharing ecosystem – there is no hope of keeping it current and very difficult to reconcile cross-system consistency. More here: Full Attribution, Don’t Leave Home Without It Out-bound Record-level Accountability in Information Sharing Systems 70 PbD: Data Destruction When the data is no longer needed or there is a mandate … purge it. For example, at the close of a special information analysis project; consider decommissioning the data sets in proportion to the consequences of unintended disclosure or misuse. If there is a legal requirement to retain data, or long term accountability is necessary, consider pushing the data to forms of retrieval useful only in the context of forensic/investigatory purposes. More here: Decommissioning Data: Destruction of Accountability 71 PbD: Limit Data Transfers If you don’t have to move the entire record: don’t. Using information sharing systems as an example, it is best not to send all the data to each (and every) information sharing partner. Better to create a central index with prescribed fields. The index then points to the original data holder – and getting access to the original record requires permission at that time, from the original data holder. This ensures a degree of transparency. More here: Discoverability: The First Information Sharing Principle 72 PbD: Data Tethering When data is moved from systems of record out into secondary systems, as the source data changes (adds, changes and deletes) these secondary systems should be notified. If the secondary systems have themselves forwarded the data to tertiary systems, these same changes should be passed through the entire food chain. More here: Data Tethering: Managing the Echo 73 PbD: Obfuscate Data For every copy there is a increasing risk of unintended disclosure. When there is an opportunity to perform data masking, anonymization, encryption … do it. Techniques now exist whereby data can be first obfuscated (e.g., encrypted, anonymized, masked, etc.) before information transfer ... while still maintaining a capability of performing deep analytics (e.g., data matching) post obfuscation. More here: To Anonymize or Not Anonymize, That is the Question 74 Maximizing Discovery - Minimizing Disclosure Persistent Context ! FEATURES: Cd5dced41028cb7ea51 00c9782a552a2d09b1b 7f2b6e48ea7d042bbe8 75 Observations Cd5dced41028cb … 00c9782a552a2 … 7f2b6e48ea7d0 … … Record #A-701 Sensors Employee Database 0d06b31faa7c… B5e341a4b0c… 00c9782a552… … Record #B-9103 Fraud Database Maximizing Discovery - Minimizing Disclosure Observations Mark Randy Smith DOB: 06/07/74 123 Main Street 713 731 5577 Record #A-701 Sensors Policy Controls Discovery Employee Database M. Randal Smith DOB: 06/07/74 713 731 5577 Record #B-9103 76 Record #A-701 Matches Record #B-9103 Policy Controls Fraud Database PbD: Build Accountability into Systems Opt for the use of tamper-resistant audit logs. The greater the lack of transparency, the greater the need for immutable logs: mandated or not. More here: Immutable Audit Logs (IAL’s) Found: An Immutable Audit Log 77 Comments on: Data Mining Data mining is not bad. There are setting where data mining is very valuable and saves lives Predictive Data Mining – Limited efficacy without volumes of training data Predicate Triage Data – Used to organize data sets containing only “subjects of interest” More here: Effective Counter-Terrorism and the Limited Role of Predictive Data Mining Data Mining, Predicate Triage and NSA Domestic Surveillance 78 Data Mining Defined (humorous) “Torturing the data until it confesses … and if you torture it enough, you can get it to confess to anything.” ACM SIGKDD Conference, Philadelphia 2006 79 Comments on: Link Analysis Link analysis is very powerful, when used in a narrow fashion. Inspection of “subjects of interest” outward. Predicate-based link analysis: Big social maps are not useful unless one has an entrance point. Link analysis: prune early More here: Hunting Bad Guys, Phone Records and a Few Good Dead Men Predicate-based Link Analysis: A Post 9/11 Analysis (1+1= 13) Sometimes a Big Picture is Worth a 1,000 False Positives 80 Comments on: Watch Listing and False Positives Difference between wrongly named and wrongly matched Low fidelity watch lists are the single biggest cause of false positives - solving this ambiguity involves additional data Minimize collection, maximize consumer participation and election Provide a redress process More here: Precision in TSA’s Terrorist Watch List Comments on the TSA No-Fly and Selectee Watch List Process 81 Closing Thoughts 82 ”The data must find the data … and the relevance must find the user.” 83 In Closing There is going to be more sensors, more data This data will be commingled for greater accuracy to serve consumers and protecting countries What data is collected/observed and when … will be the debate Chief privacy principle: Avoid consumer surprise If it has been collected, the holder has the obligation to make sense of it Organizations must harness data to be smart, efficient, and survive … but how smart do they need to be and do we trust them? Hence the tension 84 Related Papers Heritage Foundation: Paul Rosenzweig/Jeff Jonas Correcting False Positives: Redress and the Watch List Conundrum Cato Foundation: Jeff Jonas/Jim Harper Effective Counterterrorism and the Limited Role of Predictive Data Mining Steptoe & Johnson: Stewart Baker Anonymization, Data-Matching and Privacy: A Case Study IEEE Security and Privacy: Jeff Jonas Threat and Fraud Intelligence: Las Vegas Style Giannino Bassetti Foundation: Jeff Ubios Transparency, Privacy and Responsibility: An Interview with Jeff Jonas Markle Foundation Nation At Risk: Policy Makers Need Better Information to Protect the Country 85 Related Blog Posts Algorithms At Dead-End: Cannot Squeeze Knowledge Out Of A Pixel Puzzling: How Observations Are Accumulated Into Context When Risk Assessment is the Risk Big Data. New Physics. The Christmas Day Intelligence Failure – Part II: Jeff Jonas’ Christmas Wish List Decommissioning Data: Destruction of Accountability Source Attribution, Don’t Leave Home Without It Data Tethering: Managing the Echo Out-bound Record-level Accountability in Information Sharing Systems To Anonymize or Not Anonymize, That is the Question Immutable Audit Logs (IAL’s) The Information Sharing Paradox Discoverability: The First Information Sharing Principle When Federated Search Bites Using Transparency As A Mask 86 Macro Trends in Counter-Terrorism Technologies And Thoughts on Responsible Innovation DETECTER Project, Brussels September 7th, 2011 Jeff Jonas, IBM Distinguished Engineer Chief Scientist, IBM Entity Analytics JeffJonas@us.ibm.com 87