Sensitive Data In a Wired World Negative Representations of Data Stephanie Forrest Dept. of Computer Science Univ. of New Mexico Albuquerque, NM http://cs.unm.edu/~forrest forrest@cs.unm.edu Introduction • Goal: Develop new approaches to data security and privacy that incorporate design principles from living systems: – – – – Survivability and evolvability Autonomy Robustness, adaptation and self repair Diversity • Extends earlier work on computational properties of the immune system: – Intrusion detection – Automated response – Collaborative information filtering Project Overview • Immunology and data: – Negative representations of information • Epidemiology and the Internet: – Social networks matter – The real world is not always scale free • The social utility of privacy: – Why is privacy an important value in democratic societies? – Evolutionary perspective Collaborations • • • • • Paul Helman and Cris Moore (UNM) Robert Axelrod and Mark Newman (Univ. Michigan) Matthew Williamson (Sana Security) Rebecca Wright and Michael de Mare (Stevens) Joan Feigenbaum and Avi Silberschatz (Yale) – Fernando Esponda’s post-doc next year. How the Immune System Distributes Detection • • Many small detectors matching nonself (negative detection). Each detector matches multiple patterns (generalization). • Advantages of distributed negative detection: – – – – Localized (no communication costs) Scalable and tunable Robust (no single point of failure) Private Applications to Computing • • • • Anomaly detectors Information filters Adaptive queries Negative representations earlier work earlier work future in progress – A positive set DB is a set of fixed length strings. – A negative set NDB represents all the strings not in DB. – Intuition: If an adversary obtains a string from NDB, little information is revealed. Example: – – – – U= All possible four character strings DB={juan, eric, dave} U-DB={aaaa, aaab, cris, john, luca, raul, tehj, tosh,.…} There are 264-3= 456973 strings in U-DB. Results • Can U-DB be represented efficiently, given |U-DB| >> |DB| ? – YES: There is an algorithm that creates an NDB of size polynomial in DB. – Strategy: Compress information using don’t care symbol. Other representations? DB U-DB NDB 000 001 01* 101 010 0*1 111 011 1*0 100 110 • What properties does the representation have? – Membership queries are tractable (linear time even without indexing). • Other queries, information leakage are future work. – Inferring information from a subset of NDB (next slide). – Inferring DB from NDB is NP-Hard (note: not doing crypto): • • • • Currently investigating instance difficulty. Algorithms for increasing instance difficulty. On-line insert/delete algorithms preserve problem difficulty. Collaborations with R. Wright, M. de Mare, and C. Moore. What information is revealed by queries? (without assuming irreversibility) • Having access to a subset of NDB (or DB) yields some information about strings outside that subset: – • Assume NDB (or DB) is partitioned into n subsets. To the query “Is x in DB,” what do I learn about x if x is not in my subset? – – – Must consult n subsets of NDB to conclude that x is in DB. Must consult the subsets only until x is found (on average n/2). Assumes that we care more about DB than U-DB. Probability and information content as the membership of strings is revealed. DB contains 10% of all possible L-length strings (formulas). Private Set Intersection • Determine which records are in the intersection of several databases i.e. – DB1 DB2 … DBn – (NDB1 NDB2 … NDBn) • Each party may compute the intersection – DBi (NDB1 NDB2 … NDBn) • Party i learns only the intersection of all the sets, • And not the cardinality of the other sets. Results cont. • How might these properties be useful? – – – – – – Protect data from insider attacks Computing set intersections Surveys involving sensitive information Anonymous digital credentials Fingerprint databases Other ideas? • Prototype implementations: – Perl, C – http://esa.ackleyshack.com/ndb – See demo Computer Epidemiology Justin Balthrop, Mark Newman, Matt Williamson 300 IP network Adminstrator network 250 Email traffic Address books 10000 200 1000 150 100 100 10 50 0 1 0 100 200 300 400 1 Degree k 10 100 1000 Degree k Science 304:527-529 (2004) • Information spreads over networks of social contacts between computers: – – • Network topology affects the rate and extent of spreading: – • Email address books. URL links. Epidemiological models, and the epidemic threshold. Controlling spread on scale-free networks: – – – Random vaccination is ineffective (e.g., anti-virus software). Targeted vaccination of high-connectivity nodes. Control degree distribution in time rather than space. The Social Utility of Privacy Robert Axelrod and Ryan Gerety • Typical framing: – Privacy values should remain as is (e.g., Lessig). – Individual rights vs. state (i.e., civil liberties vs. community safety / crime). • A community may have its own interest in defending individual privacy (and not), independent of the civil liberties argument: – To promote innovation in changing environments. – To cope with distortions (e.g., overconfidence of middle managers). – To compensate for overgeneralized norms. • Not necessarily advocating more privacy: – From a societal/informational point of view how should appropriate bounds on privacy be determined? • Current status: – Exploratory modeling based on simple games. Next Steps: Negative Representations • • • Distributed negative representations Leaking partial information Relational algebra operators on the negative database: – Select, join, etc. • Instance difficulty: – Hiding given satisfying assignments in a SAT formula – Approximate representations – Other representations? • • More realistic implementations Negative data mining: – Is it easier/harder to find certain instances in NDB? • Imprecise representations: – Partial matching and queries – Learning algorithms People Stephanie Forrest Paul Helman Fernando Esponda Elena Ackley Publications • • • • • • F. Esponda, S. Forrest, and P. Helman ``Negative representations of information.'' International Journal of Information Security (submitted March 2005). F. Esponda, E.~S. Ackley, S. Forrest, and P. Helman ``On-line negative databases.'' Journal of Unconventional Computing (in press). F. Esponda, S. Forrest, and P. Helman. ``A formal framework for positive and negative detection.'' IEEE Transactions on Systems, Man, and Cybernetics 34:1 pp. 357-373 (2004). J. Balthrop, S. Forrest, M. Newman, and M. Williamson.``Technological networks and the spread of computer viruses.'’ Science 304:527-529 (2004). H. Inoue and S. Forrest ``Inferring Java security policies through dynamic sandboxing.'' "2005 International Conference on Programming Languages and Compilers (PLC'05) (in press). F. Esponda, E. Ackley, S. Forrest, and P. Helman. ``On-line negative databases.'' Third International Conference on Artificial Immune Systems (ICARIS) Best paper award (2004). SUPPLEMENTARY MATERIAL Probabilities F1 P(x DB | x NDB fj ) | DB | |U | | NDB fi | | DB | | DB fj | F2 P(x DB | x DB fj ) |U | | DB fj | HN (x) F1 log 2 F1 (1 F1)log 2 (1 F1) HP (x) F2 log 2 F2 (1 F2 )log 2 (1 F2 ) BACK Generating Hard-to-Reverse Negative Databases Instance Difficulty (l=64) 900 • • The randomized algorithm can be used to create a negative database. Insert/Delete operations turn known hard formulas into negative databases. The Morph operator may be used to search for hard instances. 700 600 500 Decisions 400 300 200 100 0 1 2 3 4 5 6 7 8 Specified bits per record (k-SAT) Instance Difficulty (Glassy8 formula l=64) 60000 50000 Decisions (zchaff) • Decisions (zchaff) 800 40000 Original NDB 30000 Updated NDB 20000 10000 0 1 2 3 4 5 6 7 8 Specified bits per record (k-SAT) H. Jia, C. Moore and B. Selman "From spin glasses to hard satisfiable formulas” SAT 2004. Effect of the Morph operation • • The Morph operation takes as input a negative database NDB and outputs NDB’ that represents the same set U-DB. The plot shows how the complexity of a database changes after applying the morph operator.