vernier - SRI International

advertisement
VERNIER
Virtualized Execution Realizing Network
Infrastructures Enhancing Reliability
Project Overview
July 2006
Background
• Commercial-off-the-shelf (COTS) software
– Large organizations, including DoD, have become dependent on it
– Yet, most COTS software is not dependable enough for critical
applications
• Security breaches
• Misconfiguration
• Bugs
• Large, homogeneous COTS deployments, such as those in DoD,
accentuate the risk, since many users
– Experience the same failures caused by the same vulnerabilities,
configuration errors, and bugs
– Suffer the same costly, adverse consequences
• Alternatives, such as government-funded development of highassurance systems present significant barriers in
– Cost
– Functionality
– Performance
VERNIER Project Overview
July 2006
2
VERNIER Project Objectives
• Develop new technologies to deliver the benefits of scaling
techniques to large application communities
– Provide enhanced survivability to the DoD computing infrastructure
– Enhance the cost, functionality, and performance advantages of
COTS computing environments
– Investigate and develop new technologies aimed at enabling
communities of systems running similar, widely available COTS
software to perform more robustly in the face of attacks and
software faults
• Deliver a demonstrated, functioning, transition-ready system
that implements these new AC survivability technologies
– Technical approach: Augmented virtual machine monitor
– Commercial transition partner: VMware, Inc.
VERNIER Project Overview
July 2006
3
Project Scope
• Collaborative detection and diagnosis of failures
• Collaborative response to failures
• Advanced situational awareness capabilities
– Collective understanding of community state
– Predictive capability: Early warning of potential future problems
• Key goal: turn the size and homogeneity of the user community
into an advantage by converting scattered deployments of
vulnerable COTS systems into cohesive, survivable application
communities that detect, diagnose, and recover from their own
failures
• What COTS?
– Microsoft Windows, IE, Office suite, and the like
VERNIER Project Overview
July 2006
4
Research Challenges
• Extracting behavioral models from binary programs
– Breakthrough novel techniques required
– Quasi-static state analysis for black-box binaries
• Scaled information sharing
– Networked application communities sharing knowledge about the
software they run
• Intelligent, comprehensive recovery
• Predictive situational awareness
– Automatic, easy-to-understand gauges
VERNIER Project Overview
July 2006
5
Breakthrough Capabilities
VERNIER Project Overview
July 2006
6
Expected Results and Impact
• COTS Product (VMware) with breakthrough capabilities for
application communities
• Scalability to 100K nodes running augmented VMware and
custom Vernier software
• Automatic collaborative failure diagnosis and recovery
• Survivable robust system
• Community-aware solution
VERNIER Project Overview
July 2006
7
VERNIER Team
• SRI International, Menlo Park, CA
–
–
–
–
Patrick Lincoln, Principal Investigator
Steve Dawson, Project manager; integration
Linda Briesemeister, Knowledge sharing; collaborative response
Hassen Saidi, Learning-based diagnosis; code analysis; situation awareness
• Stanford University
– John Mitchell, Stanford PI; code analysis; host-based detection and response
– Dan Boneh, Knowledge sharing protocols
– Mendel Rosenblum, VMM infrastructure; collaborative response; transition
liaison
– Alex Aiken, Quasi-static binary analysis
– Liz Stinson, Botswat; system security
• Palo Alto Research Center (PARC)
– Jim Thornton, PARC PI; configuration monitoring and response; situation
awareness
– Dirk Balfanz, Community response management
– Glenn Durfee, Configuration monitoring and response; situation awareness
• Technology transition partner: VMWare, Inc.
VERNIER Project Overview
July 2006
8
VERNIER Technical Approach
VERNIER Project Overview
July 2006
9
Notional Host System Architecture
VERNIER Project Overview
July 2006
10
An Abstraction-Based Diagnosis
Capability for VERNIER
Objectives
Based on the general principle: “much of security amounts to making sure
that an application does what it is suppose to do…….. and nothing else!”
•
•
•
•
•
Build models of applications behaviors (what the application is suppose to do).
Monitor applications behavior and report malfunctions and unintended behaviors
(deviations from behavior).
Use the recorded execution traces as raw data to a set of abstraction-based diagnosis
engines (why did the deviation from good intended behavior occurred……to the extent to
which we can do a good job answering such question).
Share the state of alerts and diagnosis among the nodes of the community (sharing the
bad news.…but also the good ones!).
Aggregate the diagnosis outputs and the alerts into a situation awareness gauge.
VERNIER Project Overview
July 2006
12
Approach
We combine a set of well known and well established techniques:
•
building increasingly accurate models of applications behaviors:
•
Implement mechanisms for monitoring sequences of states and actions of an application
for the following purposes:
–
Static analysis combined with predicate abstraction to build Dyck and CFG models used for static
analysis-based intrusion detection
–
–
–
Check if a known bad sequence is executed (signature-based!)
Check for previously unknown variations of known bad sequences (correlation!)
Find root-causes for unexpected malfunction and malicious exploits (Diagnosis)
–
–
Delta-debugging (root-cause diagnosis)
Anomaly detection (correlation)
•
Diagnosis is performed using techniques borrowed from
•
The situation awareness gauge is implemented as a platform independent web interface
VERNIER Project Overview
July 2006
13
Monitoring-Based Diagnosis
• We combine these techniques into two phases:
– Monitoring: Applications are monitored and sequences of executions along
with configurations are stored.
– Diagnosis: Differences between good runs and bad runs are the first clues
used for diagnosis
• Traces of executions are sequences of:
–
–
–
–
System calls
Method calls
Changes in configurations
The more information is stored, the better chance that malfunctions and
malicious behaviors are properly diagnosed.
VERNIER Project Overview
July 2006
14
Quasi-static binary analysis and predicate
abstraction-based intrusion detection
•
Use static analysis for recovering the control flow graph the application.
•
Build a pushdown system which is a model that represents an over
approximation of the sequences of methods and system calls of the
application.
– CFG generated by compliers for source code.
– Recover class hierarchy for object code of OO applications.
– Deal with context sensitivity to match exit calls to return locations.
•
Use predicate abstraction and data flow analysis to refine the pushdown
system and obtain a more accurate model.
– Improving the knowledge about arguments to monitored calls.
VERNIER Project Overview
July 2006
15
Better Models and Better Monitoring
We are not just interested in detection intrusions, but by
also generating high-level explanations of why an
application deviates from its intended behavior.
•
•
•
•
•
CFG and Dyck models are all over-approximations of the applications behavior
(potential attacks are only discovered when the application behavior deviates from
the model).
We will use the runs of the application to generate under-approximations of the
applications behavior!
Alternatively, ever model representing an over-approximation has a dual that
represents an under-approximation (over and under-approximations don’t have to
be the same type of models!).
We will combine over and under approximation to reduce the risk of missing
possible attacks.
We will refine the over and under approximations to improve the application
model.
VERNIER Project Overview
July 2006
16
Combining over and under approximations
Behavior outside the
over approximation
Is unsafe
Behavior in between
Is suspicious and
Is source of diagnosis
Behavior within the
under approximation
Is safe
Over approximation
(constructed by static analysis)
Under approximation
(constructed from runs)
VERNIER Project Overview
July 2006
17
What if we don’t have a model of the
application?
• We can monitor the application as a blackbox and
intercept system calls:
– Learn a model of good behaviors
– Learn a model of bad behaviors
• Anomalies are difference between good and bad
behaviors
• Borrow from delta-debugging techniques to find rootcauses of misbehaviors
VERNIER Project Overview
July 2006
18
Configuration-based Detection,
Diagnosis, Recovery, and Situational
Awareness
Importance of Configuration
• Static configuration state highly correlated with system behavior
– Many attacks/bugs/errors introduced by way of a substantive change to
configuration
“A central problem in system administration is the construction of a secure
and scalable scheme for maintaining configuration integrity of a computer
system over the short term, while allowing configuration to evolve gradually
over the long term”
– Mark Burgess, author of cfengine
VERNIER Project Overview
July 2006
20
AC Opportunity
• Leverage scale of population to learn what are bad states in
configuration space
Reliability
Want to be here
Adaptability
Today: Every configuration
change is an uncontrolled
experiment
AC Future: Configuration
changes managed as controlled
reversible trials
VERNIER Project Overview
July 2006
21
Live Monitoring of Configuration State
1.
State analysis
•
•
•
2.
Detect change events
•
•
•
3.
Comparative diagnosis
Vulnerability assessment
Clustering similar nodes and contextualizing observations
Cluster low-level changes into transactions
Log events for problem detection, mitigation and user interaction
Share events in real-time for situational awareness
Active learning
•
•
Automated experiments to isolate root causes
Managed testing of official changes like patch installation
VERNIER Project Overview
July 2006
22
Live Control of Configuration State
• Modification for Reversibility and Experimentation
– Coarse-grained: VM rollback
– Medium-grained: Installer/Uninstaller activation
– Fine-grained: Direct manipulation of low-level state elements
• Prevention
– In-progress detection of changes
– Interruption of change sequence
– Reversal of partial effects
VERNIER Project Overview
July 2006
23
Identifying Badness
• Objective Deterministic Criteria
– Rootkit detection from structural features
– Published attack signatures
• Objective Heuristic Criteria
– Performance outside of normal parameters
• Subjective End-User Report
– Dialog with user to gather info, e.g. temporal data for failure appearance
• Administrative Policy
– Rules specified by administrators within community
VERNIER Project Overview
July 2006
24
Local Components
Community
3
App VM
COTS
VERNIER VM
Console
(UI)
App 1
App 2
Agent
App OS
1
Comm
Experimental VM
Diag
VERNIER Monitor/Control
2
VERNIER OS Base
1
App 1
Agent
App OS
VMM (VM Kernel)
VERNIER Project Overview
July 2006
25
App 2
Key Interfaces
1
VERNIER-Agent
(TCP/IP, XML?)
Registry change events
Filesystem change events
Install events
Manipulate registry
Manipulate filesystem
Control System Restore
2
VERNIER-VMM
(?)
Suspend
Resume
Checkpoint
Revert
Clone
Reset
Lock memory
Process events
Read memory
Read/write disk
VERNIER Project Overview
July 2006
3
VERNIER-Community
(?)
Cluster management
Experience reports
• Unknown
• Prevalent
• Known Bad
• Presumed Good
State exchange
Experiment request/response
26
Local Functions
Community
Console
Communication Manager
Network
Tap
Response
Controller
Analysis &
Diagnosis
Detector
Config
Change
Configuration
Analysis
Detector
Agent
Inside
Event Stream
Network
Event
Behavior
Analysis
Detector
Behavior
Event
VMM
Traffic
Analysis
Local DB
Local condition detail
Event logs
Labeled condition signatures
State snapshots
Experimental data
Firewall
VERNIER Project Overview
July 2006
27
Adapting and Extending Host-based,
Run-time Win32 Bot Detection for
VERNIER
Exploit botnet characteristic: ongoing
command and control
• Network-based approaches:
– Filtering (protocol, port, host, content-based)
– Look for traffic patterns (e.g. DynDNS – Dagon)
– Hard (encrypt traffic, permute to look like ‘normal’ traffic, …);
botwriters control the arena.
• Host-based approaches:
– Ours: Have more info at host level.
Since the bot is controlled externally, use this meta-level behavioral
signature as basis of detection
VERNIER Project Overview
July 2006
29
Our approach
• Look at the syscalls made by a program
– In particular at certain of their args – our sinks
• Possible sources for these sinks:
– local: { mouse, keyboard, file I/O, … }
– remote: { network I/O }
• An instance of external control occurs when data from a
remote source reaches a sink
• Surprisingly works really well: for all bots tested (ago, dsnx,
evil, g-sys, sd, spy), every command that exhibited external
control was detected
VERNIER Project Overview
July 2006
30
Big picture
VERNIER Project Overview
July 2006
31
Design
VERNIER Project Overview
July 2006
32
Two modes
• Cause-and-effect semantics:
– Tight relationship between receipt of some data over network
and subsequent use of some portion of that data in a sink
• Correlative semantics: looser relationship
– Use of some data that is the same as some data received over
the network
– Why necessary?
VERNIER Project Overview
July 2006
33
Behaviors: ideally disjoint;
@ lowest level in call stack
VERNIER Project Overview
July 2006
34
Correlative semantics
• Why necessary
• Why bots with C library functions statically linked in ~=
unconstrained OOB copies
• In general almost as good as cause-and-effect semantics
(stat vs. dyn link)
– Exceptions: cmds that format recv’d params (e.g. via
sprintf)
VERNIER Project Overview
July 2006
35
Benign program testing
• Tested against some benign programs that interact with the
network
– Firefox, mIRC, Unreal IRCd
• 3 contextual false positives
– IRCd: sent on X heard on Y
– Firefox: dereferencing embedded links
• Artificial false positives: quite a few
– mIRC: DCC capabilities
– Firefox: saving contents to a file, …
VERNIER Project Overview
July 2006
36
False positives
a)
contextual false positives – not present in bots

b)
artificial false positives – not present in bots


c)
external control heuristic correctly detected but these actions under
these circumstances widely accepted as non-malicious
def of external control implies no user input agreeing to particular
behavior
but we don’t track “explicitly clean” data (that received via kb,
mouse)
spurious false positives
a)
any other incorrect flagging of external control
VERNIER Project Overview
July 2006
37
Our mechanism — review
• Single behavioral meta-signature detects wide variety of
behaviors on majority of Win32 bots
– Resilient to differences in implementation
• Resilient in face of unconstrained OOB copies
• Resilient to encryption – w/some constraints
• Resilient to changes in command-and-control protocol (e.g.
from IRC to HTTP) and parameters (e.g. for rendezvous
point)
VERNIER Project Overview
July 2006
38
Knowledge Sharing in VERNIER
Knowledge Sharing
• Need: Communication is the core concept of a community
– Application communities rely on ability to share knowledge
Reliable, Efficient, Authentic, Secure
• Approach: two-tier peer-to-peer platform
– Tuple space (ala Linda)
– Considering JavaSpaces implementation of tuple spaces
– Two-tier for better scalability
• If needed, hypercube hashtable index (ala Obreiter and Graf)
• Benefits: Reliable, efficient (local) knowledge sharing
• Competition: Other possible methods for knowledge sharing
include explicit messaging, centralized database, and statically
indexed knowledge structures.
– Other approaches lack scalability, are unreliable, and can be
difficult to secure
VERNIER Project Overview
July 2006
40
Knowledge Sharing Levels
• Lower level (within a cluster)
– Tuple space (ala Linda (Gelernter))
– Simple queries
• (*, name, *) returns records regarding ‘name’
– Concurrent access and update
• Higher level (supernodes)
–
–
–
–
Nodes aggregate knowledge of an entire cluster
Use abstraction to summarize current situation
Application-level multicast to push out summaries
Supernode pushes all summary updates into local tuple space
VERNIER Project Overview
July 2006
41
Group Communication
• Group communication is key
– For higher level, certain usual assumptions
• Reliable delivery
• Ordered message delivery
• Spread (www.spread.org) as a basis for implementation of
group communication
– Building on secure spread and progress software (progress.com)’s
more secure, reliable, scalable variants of spread
VERNIER Project Overview
July 2006
42
Group Communication Security and Privacy:
Secrecy and Authenticity
• Security and privacy are critical aspects of VERNIER
• Must authenticate reports and ensure correctness
• Confidentiality of reports
–
–
–
–
Protecting user privacy (my files, my keystrokes)
Protect aspects of applications
Protect configuration information
Protect vulnerability detection information
• Community members send status reports to local supernode
• Reports propagated throughout network
VERNIER Project Overview
July 2006
43
Group Communication Security
• Defense against:
– network attacks sending forged messages to supernodes
+ PKI
– Compromised community member sending false reports
+ statistical anomaly detection (eg EMERALD)
+ Virtualization
Any report generated within compromised virtual machine must be
consistent with what is observed outside the virtualization layer
VERNIER Project Overview
July 2006
44
Group Communication Security
• Secure audit logs
– Secure log of all P2P status reports
– Enable post-mortem analysis on detected attacks
– Cryptographic protection of log (Boneh, Waters)
• Sanitizing stats reports
– Status reports reveal private information
– Special encryption enabling read only by credentialed members
and search (as in search over encrpyted database) by community
• Mitigating denial of service attacks on supernodes
– Re-election of supernodes when under attack
• Securing configuration update messages
– PKI authenticating legitimate reports from community members
VERNIER Project Overview
July 2006
45
Schedule, Experimentation, and
Evaluation
Schedule and Milestones
Phase 1
Phase 2
Infrastructure
VMM rec overy/rollbac k
VMM enforc ement mec h
Diagnosis
Config management
Quasi- static analysis
Learning- based diagnosis
Sharing protoc ols
Response
Comm response mgmt
App. integration
Awareness
Situation awareness gauge
System Integration
Unit testing & integration
Sc alability dev/testing
Delivery & Transition
Software pac kaging
Tec h transition planning
Testing & Evaluation
Sys testing & metric s
Red teaming support
Management
Status Reports
Integration Milestones
PI Meetings & Demos
Red Teaming
Software & Doc Delivery
Q1
Q2
CY'06
Q3
Q4
Q5
CY'07
VERNIER Project Overview
July 2006
Q6
Q7
Q8
Q9
CY'08
47
Q10
Experimentation and Evaluation
• Project testbed
– Network of 300 virtual hosts
• 30 server-class physical hosts
• 10 virtual nodes per server
– Three clusters, one at each participant site
• Software
– Host OS: Linux
– Guest (community) OS: Microsoft Windows
– Applications: IE browser (possibly others); MS Office
• Simulations and scalability
– Financially infeasible to scale to thousands of nodes
– Plan is to use hybrid simulation to test scalability
• Real (live) nodes provide actual data
• Simulated nodes use synthesized data generated by perturbing data
collected from real clusters’ supernodes
VERNIER Project Overview
July 2006
48
Proposed Success Criteria
• Metrics and targets (team-defined)
– False positives (FP) / False negatives (FN)
• Phase 1: FP < 10%, FN < 20%
• Phase 2: FP < 1%, FN < 2% (order of magnitude improvement)
– Percent loss of network availability
• Phase 1: At most 20% per node, with at most 80% over any 500ms interval
• Phase 2: At most 5% per node, with at most 20% over any 500ms interval
– Average time to recovery
• Phase 1: Assuming a fix exists (not a FN), at most 30 minutes to recover the entire
community
• Phase 2: At most 10 minutes
– Average network and computational overhead
• No more than 30% slowdown for applications
• No more than 100 KB/s average VERNIER-induced network traffic per node
– Percent accuracy of prediction
• Phase 1: Effects of problems predicted within 15 minutes of onset; set of nodes
wrongly predicted (either way) differs by no more than 40% of actual
• Phase 2: Prediction within 5 minutes; predicted set differs by no more than 20%
VERNIER Project Overview
July 2006
49
Download