Coordinated Statistical Modeling and Optimization for Ensuring Data

advertisement
Coordinated Statistical Modeling and
Optimization for Ensuring Data Integrity
and Attack-Resiliency in NetworkedEmbedded Systems
Farinaz Koushanfar, ECE Dept.
Rice University Statistics Colloquium
Oct 9, 2006
outline
• Sensor Networks: Applications, Challenges
• Coordinated Modeling-Optimization Framework
– Inter-sensor models
– Embedded sensing models
– Optimization for data integrity
• Attack Resilient Location Discovery
– Problem formulation and attack models
– Robust random sample consensus for attack
detection
2
Sensor Networks
• Comprehensive monitoring and analysis of
complex physical environments
• Imagine… Air pollution
Flood in Houston
Vibration in Abercrombie
Texas wine!!
http://www.ucsusa.org/clean_energy/coalvswind/c02c.html
http://www.bluishorange.com/flood/photos/10bridgecars.jpg
http://dacnet.rice.edu/maps/space/index.cfm?building=abc
http://www.alamosawinecellars.com/vineyard2.htm
3
Sensor Networks, How?
• Networks of embedded sensing (actuating)
and computing devices
Mica2Dot,
CrossBow Tech.
4
Courtesy of Prof. Estrin, CENS, ULCA
Challenges in Sensor Networks
• System: sensors, actuators, hardware, software,
communication network layers,
• Limited: battery, bandwidth, cost
• Unique to sensor networks: Sensing
– Abstract the system state, complex properties, and
model physical phenomena accurately, without biases
• Parametric models: a priori assumptions
• Often do not capture the complex relationships
– Optimization based on such models have a limited
effectiveness
5
Challenges in Sensing
• Massive datasets
– Structure response in USGS building: 72 channels of 24 bit data, 500
samples/sec.
• Energy consumption of the wireless nodes
– Motes take 36mW in active mode  AA batteries + storage capacity of
1850mWh  50h active mode
• Diversity in applications
– Marine biology, seismic sensing, battlefield, contaminant transport,
home sensors, laboratories, hospitals, etc.
• Harsh environmental conditions
– Battlefield, earthquakes, automatic detection, etc.
•
•
•
•
Wireless channel data loss
Sensor cost
Sensitivity of applications
Privacy and security
6
Inconsistencies in the Measured
Sensor Data
• Erroneous measurements
– Noisy readings: inevitable due to power and cost constraints and
environmental impact
– Systematic errors: offset bias, calibration effect, etc
– Partially corrupted, still useful
• Faulty (corrupted) measurements
– Remove faults to get a consistent picture
– Can be accidental (e.g. bad link), or malicious
• Missing data
– May be accidental, intentional (sleeping, subsampling,
compression, filtering), or malicious
7
Outline
• Sensor Networks: Applications, Challenges
• Coordinated Modeling-Optimization Framework
– Inter-sensor models
– Embedded sensing models
– Optimization for data integrity
• Attack Resilient Location Discovery
– Problem formulation and attack models
– Robust random sample consensus for attack
detection
8
Motivational Example
• Deployments show a gap b/w models and the reality
• Example: preliminary analysis of temperature sensor
traces at UCLA BG
• 23 sensor nodes, sampling each 5 mins
• Question: does the locality assumption hold?
9
7
22
8
6 21
10
20 16
19
13 5
14
17
15
2
12
1
18
11
4
3
9
Motivational Example (Cont’d)
• No consistent relation b/w sensing
and distance
• Discontinuities, exposure
differences, global sources
• Also, some highly correlated closeby sensors
• Best previous effort: local basis
functions
• Need new models for simultaneous
abstraction of sensing and distance
• What about other properties?
9
7
22
8
6 21
10
20 16
19
13 5
14
17
15
2
12
18
11
4
1
3
10
Motivational Example (Cont’d)
• E.g., sensing, distance
– dij: distance b/w si,sj
– eij: sensing prediction error, for the model
sj=fij(si)
• The distance and sensing are not
jammed into one model, but are being
simultaneously considered
1 0
2  .1

....

 ....
22  .4

23 .27
…
– Define multiple graphs G1, G2, …, GM, that
share vertices
1
2
3
…
22
23
.4 .... .41 .34 
0 .32 .... .45 .42 
.... .... .... .... .... 

.... .... .... .... .... 
.41 .28 .... 0 .11

.31 .26 .... .09 0 
.1
Distance graph:
adjacency matrix
1
0

2 11

....

....
22 64

23 66

1
…
• Separation of concerns
• Embedded sensing models:
Sensing graph:
adjacency matrix
2
3
…
22
23
11 105 .... 64 66 
0 94 .... 57 65 
.... .... .... .... .... 

.... .... .... .... ....
57 99 .... 0 32

65 130 .... 3211 0 
Motivational Example-2
• Cross-domain optimization: Sensor deployment
– Objective: select up to S candidate points for adding
an extra sensor
– For each si, a TL sensor is Delaunay neighbor but
cannot be predicted within th error bound
– Denote the edges of TL sensors as candidates
– Find intelligent ways to select the best set of
candidate points
12
Motivational Example (Cont’d)
• Coordinated modeling-optimization
– Q1: How to do cross-domain optimization?
– Q2: Can the models be of higher dimensions?
– Q3: Can they help us to address data-integrity
problem?
– Q4: How effective are they?
13
Outline
• Sensor Networks, Applications, Challenges
• Coordinated Modeling-Optimization Framework
– Inter-sensor models
– Embedded sensing models
– Optimization for data integrity
• Attack Resilient Location Discovery
– Problem formulation and attack models
– Robust random sample consensus for attack
detection
14
Inter-sensor Models
• Intra-sensor models (autoregressive models)
• Have shown the effectiveness of adding shape
constraints to univariate models
–
–
–
–
–
–
Isotonicity
Unimodularity
Number of level sets
Convexity
Bijection
Transitivity
• Combinatorial isotonic regression (CIR), finds the
optimal nonparametric shape constrained univariate fit
for an arbitrary error norm in average linear time
• Models are precursor for subsequent optimization
15
Application of CIR on Temperature
Sensors at Intel Berkeley*
• Prediction error over all node pairs
• Limiting the number of level sets
* Koushanfar, Taft (Intel), Potkonjak
Infocom’06
16
Multivariate CIR
• Recent result*:
– The first optimal, polynomial-time DP-based approach
for multi-dimensional CIR:
(1) Build the relative importance matrix R
(2) Build the error matrix E
(3) Build a cumulative error matrix C by using a
nested DP
(4) Starting at the minimum value in the last column of
C, trace back a path to the first column that minimizes
the cumulative error
•Thanks to Prof. D. Brillinger (UCB), Prof. M. Potkonjak (UCLA) for the
useful discussions
17
Example
(2) Input: 4*4*4 error matrix E, A=4
z0 x 0
y0  7
y1  3
y2  7
y3 0
z2 x0
y0  9
y1  5
y2  1
y3  0
x1 x2 x3
12
9
6
12
10
16
5
0
1
3 
9

9
x1 x2 x3
8
2
7
4
1
8
4
0
1
3 
5

3
(3) 3D view
of DP on E
z(3)
z(2)
z(1)
z(0)
x
x(2) (1)
x(3) X
z1 x0
y0  5
y1  4
y2  3
y3 0
x1 x2 x3
z3 x0
y0 17
y1  6
y2  5
y3  0
x1 x2 x3
8
5
6
3
3
8
9
0
0
3 
6

4
2
15 12 3
2 5 6

1 0 6
12
3
(3) The steps of nested DP on E
y0 x0 x1
z 7 19
1 0 
z1 5 13
z2 5 13
z35 13
y2 x0
3 z0 17
z1 12
z2 10
z3 10
x2 x3
25 26
16 16
15 16

15 16
y3 x0
4 z0 17
z1 12
z2 10
z3 10
x1 x2 x3
48
31
27
27
78
51
42
42
91
60
50

50
Z
y(3) y(3)
Y
y1 x0
2 z0 10
z1  9
z2  9
z3  9
(4) Final
bivariate
regression
x1 x2 x3
31 49 53
22 33 36
22 32 36

22 32 36
x1 x2 x3
53 87 119
34 71 67 
28 43 54 

28 43 54 
x0 x1 x2 x3
y0
y1
y2
y3
 z1
z
 1
z2

z2
z1 z1 z1 
z1 z1 z1 
z2 z2 z2 

z2 z2 z2  18
Multivariate CIR - complexity
• T sensor values drawn from a finite alphabet A
• Complexity of univariate case is dominated by
sorting (T log T)
• Cm(M): complexity of multivariate with M
explanatory variables
• Cm(M)=AM+1Cm(M-1), pseudo-polynomial
complexity
19
Open Questions
• How to speed up the Multivariate CIR?
– Pruning algorithms that exploit sparsity (?)
• Is it possible to make CIR locally adaptive?
– In principle, finding the min error is a global optimization that
cannot be locally addressed
• Can one guarantee convergence and correctness of CIR
among sensors?
• Is it possible to have continuous approximations to
address the problem?
• How can one build efficient models in presence of missing
and/or faulty data?
20
Outline
• Sensor Networks, Applications, Challenges
• Coordinated Modeling-Optimization Framework
– Inter-sensor models
– Embedded sensing models
– Optimization for data integrity
• Attack Resilient Location Discovery
– Problem formulation and attack models
– Robust random sample consensus for attack detection
– Evaluation and comparison to competing methods
21
State-of-the-Art Sensing Models
• Parametric models
– Gaussian random fields, graphical models (GM),
message passing, iterative message passing, belief
propagation (BP)
• Nonparametric models
– Marginalized kernels (GM), alternating projections,
distributed EM, nonparametric BP
• Common thread: capture dependence among
sensor data, no edge means no dependence,
• Need to capture the shape of field discontinuities
and/or lack of correlations b/w adjacent nodes
22
Embedded Sensing Models
• Principle of separation of concerns (SoC)
• Example:
Geometric graph (planar-2D)
Delaunay edges (adjacency)
Sensing graph: higher
dimensional embedded graph
1
5
7
4
2
6
8
2
1
6
3
3
8
5
4
7
Idea: Map the sensing graph into lower dimensions.
Exploit the discrepancy between the higher dimensional topology and
the lower dimensional space to identify the obstacles
23
Open Questions
• Efficient computation and handling of embedded
sensing models in higher dimensions
• Joint compression of multiple entities
• How can we capture dynamic topologies, i.e.
mobility, dynamic time series, sleeping
• Efficient structures/data formats for representing
the multi-dimensional topologies
24
Coordinated Modeling and
Optimization
• Paramount importance of interface in system and
software development
• Create statistical models suitable for optimization
–
–
–
–
–
Paradigms: continuous, smooth, consistent
Small number of level sets
Convexity
Bijection x'i= G(F(xi)) = xi, where yi=F(xi) and xi=G(yi)
Transitivity zi = F(xi) = G(yi)
• Create optimization mechanisms resilient to statistical
variability
–
–
–
–
Paradigms: randomization
Multiple validations
Constructive probabilistic
Reweighting of OF and constraints
25
Outline
• Sensor Networks, Applications, Challenges
• Coordinated Modeling-Optimization Framework
– Inter-sensor models
– Embedded sensing models
– Optimization for data integrity
• Attack Resilient Location Discovery
– Problem formulation and attack models
– Robust random sample consensus for attack detection
– Evaluation and comparison to competing methods
26
Data Integrity: Multiple Validations
• The data-integrity problems are complex due to the
complex environments and uncertainties
– Proof of NP-completeness (PhD’05)
• Data integrity (noise reduction, calibration, fault
detection, data recovery) exploits system redundancies
• Coordinated modeling-optimization
• Multiple validations (MV) optimization algorithms
– The solutions are validated using multiple input samples
– Similar in spirit to cross-validation (CV) in statistics
– MV is more comprehensive than CV, since it is a generic
optimization paradigm based on resampling the input space and
validating the output of a complex algorithm rather than a model
27
Example: Missing Data
Temperature (C)
missing/present vs. time
19.2
Present data
19.0
Missing data
18.8
Temperature (C)
• Between 40%-50% missing data
at Intel Berkeley testbed
• Limited A2D: discrete level sets
0
50
100
150
Time
Humidity (%)
missing/present vs. time
39.0
39.5
Present data
38.5
Humidity (%)
Missing data
0
20
40
60
80
100
120
Time
28
140
Missing Data Recovery (MSD)
Problem formulation:
• Given:
2
6
1
5
9
4
3
8
– N sensors s1, …,sN,
7
– Sensor’s data at time t: (d1(t), d2(t),…,dN(t))
– Some sensor data missing in an arbitrary way, i.e.
there is i, such that di(t)=NA
• Objective: recover the missing data in such a
way that the consistency between the readings
of different sensors is maximized (prediction
error is minimized)
29
State-of-the-Art in MSD
• MSD is a prevalent problem in many fields
• Expectation maximization (EM) Dempster et al.’77
– Assuming multivariate density
– Local optimization, likely to be trapped in the local
max of the likelihood function
• Multiple imputations (MI) Rubin 1987
– Missing data replaced by multiple simulated versions
– May distort variable association dues to treating the
completed dataset as the actual one
• Both MI/EM can be computationally intensive
• MV often combines lower dimensional models
30
MV for Missing Data Recovery
• Iteratively select a sub-sample of available nodes (the
present set) and optimize for it
• Remaining nodes (holdout set) used for validating the
solution, quantify its uncertainty
– 1) Randomly assign :{1,…,|V|}{1,…,K};
– 2) for (=1 to =K)
• a: calculate OF-(O);
• b: compute MVC-k(O);
– 3) MVC(O)=G(MVC-k(O)), =1, …, K;
– 4) Obest= argminO MVC(O);
• Advantage: not only a solution, but an uncertainty bound
for the solutions
31
Open Questions
• Theoretical proof of correctness of MV, which of the
properties of CV holds for MV?
• Which MV criteria (MVC) are robust to outliers: e.g.,
order statistics
• Which objective function (OF) to use?
– Ensemble-voting of weak classifiers by boosting (exponential
loss function)
• Real-time implementation on sensor networks testbeds
• Scaling properties of the MV algorithm
32
Outline
• Sensor Networks, Applications, Challenges
• Coordinated Modeling-Optimization Framework
– Inter-sensor models
– Embedded sensing models
– Optimization for data integrity
• Attack Resilient Location Discovery
– Problem formulation and attack models
– Robust random sample consensus for attack detection
– Evaluation and comparison to competing methods
33
Location Discovery*
• A number of nodes have
location data (beacons)
• Other nodes estimate their
distance to beacons to find their
locations
• Many distance estimation
methods (e.g., AoA, ToA)
• If more than three beacons,
node can estimate location
• We focus on the atomic case
(one unknown)
* Joint work with N. Kiyavash, UIUC
s1
s2
s3
s5
s0
s4
s6
s7
s10
s8
s9
34
Robust Location Discovery:
Problem Formulation
• Instance:
– A node s0 with unknown coordinates (x0,y0),
– Set L of location tuples {(xn,yn,dn)} (n beacons),
– Consistency metric (sn,s0), consistency
threshold t
• Problem:
– Find an estimate for (x0,y0) s.t. it is at least
(sn,s0)-consistent with t points in set L
35
Attack Model
• The attackers can modify the distance measurement of
any beacon without any limits
• The network is cryptographically protected against
protocol attackes, e.g., wormhole, sybil
• The measurements from each beacon are only
considered once
• Both independent and coalition (colluding) attacks
• In coalition attacks, the attacking beacons coordinate
their efforts
• There is a minimum number of correct beacons,
otherwise colluding beacons will mislead the target
36
Robust Random Sample
Consensus
1. Initialize i;
2. While (i<imax)
a. Randomly draw a subset Si of size 3 from L;
b. Use Si to estimate s^0;
c. Calculate K, the number of  consistent points w.r.t
s^0 in L\Si;
d. If (K>t)
i.
{form a new s^0 from the K points; Terminate;}
e. Increment i;
3. Terminate and output the largest consistent
estimate;
37
Selecting the parameters: imax,t
• q - prob. of correctness of a randomly drawn point
• Expected number of trials, E[i]=1/q3
•  - threshold for the prob of missing a good subset, (1q3)imax=  Or, imax= ln() / ln(1-q3)
• I – set of inliers;  - percentage of inliers
I 
•  =1-Na/N
 
3
I j
q   
N j
N
 
3
2
q=3,
E[i]=-9
j 0
• For large datasets
• The number of iterations is
i
max
ln

ln( 1   )
9
38
Evaluation – Random Sample
consensus
Na/N |s^0-s0| FN %
10% 0.06
0
20% 0.07
1
30% 0.07
2.5
40% 0.11
3.5
50% 0.13
3.7
39
Comparison to Other Algorithms
Independent attackers
Colluding attackers
40
Summary
• sensor networks: importance of sensing, data integrity –
missing data, faults, noise, systematic errors
• Coordinated modeling and optimization framework
–
–
–
–
–
Nonparametric models, shape constraints
Multivariate CIR, optimal algorithm, slow in multiple dimensions
Embedded sensing models, separation of concerns
Projection into lower dimensions
Optimization algorithm: multiple validations (MV)
• Attack-resilient location discovery
– 25% more effective in presence of coalition attackers, 35+%
more effective on independent attackers
41
Download