Coordinated Statistical Modeling and
Optimization for Ensuring Data Integrity
and Attack-Resiliency in NetworkedEmbedded Systems
Farinaz Koushanfar, ECE Dept.
Rice University Statistics Colloquium
Oct 9, 2006
outline
• Sensor Networks: Applications, Challenges
• Coordinated Modeling-Optimization Framework
– Inter-sensor models
– Embedded sensing models
– Optimization for data integrity
• Attack Resilient Location Discovery
– Problem formulation and attack models
– Robust random sample consensus for attack
detection
2
Sensor Networks
• Comprehensive monitoring and analysis of
complex physical environments
• Imagine… Air pollution
Flood in Houston
Vibration in Abercrombie
Texas wine!!
http://www.ucsusa.org/clean_energy/coalvswind/c02c.html
http://www.bluishorange.com/flood/photos/10bridgecars.jpg
http://dacnet.rice.edu/maps/space/index.cfm?building=abc
http://www.alamosawinecellars.com/vineyard2.htm
3
Sensor Networks, How?
• Networks of embedded sensing (actuating)
and computing devices
Mica2Dot,
CrossBow Tech.
4
Courtesy of Prof. Estrin, CENS, ULCA
Challenges in Sensor Networks
• System: sensors, actuators, hardware, software,
communication network layers,
• Limited: battery, bandwidth, cost
• Unique to sensor networks: Sensing
– Abstract the system state, complex properties, and
model physical phenomena accurately, without biases
• Parametric models: a priori assumptions
• Often do not capture the complex relationships
– Optimization based on such models have a limited
effectiveness
5
Challenges in Sensing
• Massive datasets
– Structure response in USGS building: 72 channels of 24 bit data, 500
samples/sec.
• Energy consumption of the wireless nodes
– Motes take 36mW in active mode AA batteries + storage capacity of
1850mWh 50h active mode
• Diversity in applications
– Marine biology, seismic sensing, battlefield, contaminant transport,
home sensors, laboratories, hospitals, etc.
• Harsh environmental conditions
– Battlefield, earthquakes, automatic detection, etc.
•
•
•
•
Wireless channel data loss
Sensor cost
Sensitivity of applications
Privacy and security
6
Inconsistencies in the Measured
Sensor Data
• Erroneous measurements
– Noisy readings: inevitable due to power and cost constraints and
environmental impact
– Systematic errors: offset bias, calibration effect, etc
– Partially corrupted, still useful
• Faulty (corrupted) measurements
– Remove faults to get a consistent picture
– Can be accidental (e.g. bad link), or malicious
• Missing data
– May be accidental, intentional (sleeping, subsampling,
compression, filtering), or malicious
7
Outline
• Sensor Networks: Applications, Challenges
• Coordinated Modeling-Optimization Framework
– Inter-sensor models
– Embedded sensing models
– Optimization for data integrity
• Attack Resilient Location Discovery
– Problem formulation and attack models
– Robust random sample consensus for attack
detection
8
Motivational Example
• Deployments show a gap b/w models and the reality
• Example: preliminary analysis of temperature sensor
traces at UCLA BG
• 23 sensor nodes, sampling each 5 mins
• Question: does the locality assumption hold?
9
7
22
8
6 21
10
20 16
19
13 5
14
17
15
2
12
1
18
11
4
3
9
Motivational Example (Cont’d)
• No consistent relation b/w sensing
and distance
• Discontinuities, exposure
differences, global sources
• Also, some highly correlated closeby sensors
• Best previous effort: local basis
functions
• Need new models for simultaneous
abstraction of sensing and distance
• What about other properties?
9
7
22
8
6 21
10
20 16
19
13 5
14
17
15
2
12
18
11
4
1
3
10
Motivational Example (Cont’d)
• E.g., sensing, distance
– dij: distance b/w si,sj
– eij: sensing prediction error, for the model
sj=fij(si)
• The distance and sensing are not
jammed into one model, but are being
simultaneously considered
1 0
2 .1
....
....
22 .4
23 .27
…
– Define multiple graphs G1, G2, …, GM, that
share vertices
1
2
3
…
22
23
.4 .... .41 .34
0 .32 .... .45 .42
.... .... .... .... ....
.... .... .... .... ....
.41 .28 .... 0 .11
.31 .26 .... .09 0
.1
Distance graph:
adjacency matrix
1
0
2 11
....
....
22 64
23 66
1
…
• Separation of concerns
• Embedded sensing models:
Sensing graph:
adjacency matrix
2
3
…
22
23
11 105 .... 64 66
0 94 .... 57 65
.... .... .... .... ....
.... .... .... .... ....
57 99 .... 0 32
65 130 .... 3211 0
Motivational Example-2
• Cross-domain optimization: Sensor deployment
– Objective: select up to S candidate points for adding
an extra sensor
– For each si, a TL sensor is Delaunay neighbor but
cannot be predicted within th error bound
– Denote the edges of TL sensors as candidates
– Find intelligent ways to select the best set of
candidate points
12
Motivational Example (Cont’d)
• Coordinated modeling-optimization
– Q1: How to do cross-domain optimization?
– Q2: Can the models be of higher dimensions?
– Q3: Can they help us to address data-integrity
problem?
– Q4: How effective are they?
13
Outline
• Sensor Networks, Applications, Challenges
• Coordinated Modeling-Optimization Framework
– Inter-sensor models
– Embedded sensing models
– Optimization for data integrity
• Attack Resilient Location Discovery
– Problem formulation and attack models
– Robust random sample consensus for attack
detection
14
Inter-sensor Models
• Intra-sensor models (autoregressive models)
• Have shown the effectiveness of adding shape
constraints to univariate models
–
–
–
–
–
–
Isotonicity
Unimodularity
Number of level sets
Convexity
Bijection
Transitivity
• Combinatorial isotonic regression (CIR), finds the
optimal nonparametric shape constrained univariate fit
for an arbitrary error norm in average linear time
• Models are precursor for subsequent optimization
15
Application of CIR on Temperature
Sensors at Intel Berkeley*
• Prediction error over all node pairs
• Limiting the number of level sets
* Koushanfar, Taft (Intel), Potkonjak
Infocom’06
16
Multivariate CIR
• Recent result*:
– The first optimal, polynomial-time DP-based approach
for multi-dimensional CIR:
(1) Build the relative importance matrix R
(2) Build the error matrix E
(3) Build a cumulative error matrix C by using a
nested DP
(4) Starting at the minimum value in the last column of
C, trace back a path to the first column that minimizes
the cumulative error
•Thanks to Prof. D. Brillinger (UCB), Prof. M. Potkonjak (UCLA) for the
useful discussions
17
Example
(2) Input: 4*4*4 error matrix E, A=4
z0 x 0
y0 7
y1 3
y2 7
y3 0
z2 x0
y0 9
y1 5
y2 1
y3 0
x1 x2 x3
12
9
6
12
10
16
5
0
1
3
9
9
x1 x2 x3
8
2
7
4
1
8
4
0
1
3
5
3
(3) 3D view
of DP on E
z(3)
z(2)
z(1)
z(0)
x
x(2) (1)
x(3) X
z1 x0
y0 5
y1 4
y2 3
y3 0
x1 x2 x3
z3 x0
y0 17
y1 6
y2 5
y3 0
x1 x2 x3
8
5
6
3
3
8
9
0
0
3
6
4
2
15 12 3
2 5 6
1 0 6
12
3
(3) The steps of nested DP on E
y0 x0 x1
z 7 19
1 0
z1 5 13
z2 5 13
z35 13
y2 x0
3 z0 17
z1 12
z2 10
z3 10
x2 x3
25 26
16 16
15 16
15 16
y3 x0
4 z0 17
z1 12
z2 10
z3 10
x1 x2 x3
48
31
27
27
78
51
42
42
91
60
50
50
Z
y(3) y(3)
Y
y1 x0
2 z0 10
z1 9
z2 9
z3 9
(4) Final
bivariate
regression
x1 x2 x3
31 49 53
22 33 36
22 32 36
22 32 36
x1 x2 x3
53 87 119
34 71 67
28 43 54
28 43 54
x0 x1 x2 x3
y0
y1
y2
y3
z1
z
1
z2
z2
z1 z1 z1
z1 z1 z1
z2 z2 z2
z2 z2 z2 18
Multivariate CIR - complexity
• T sensor values drawn from a finite alphabet A
• Complexity of univariate case is dominated by
sorting (T log T)
• Cm(M): complexity of multivariate with M
explanatory variables
• Cm(M)=AM+1Cm(M-1), pseudo-polynomial
complexity
19
Open Questions
• How to speed up the Multivariate CIR?
– Pruning algorithms that exploit sparsity (?)
• Is it possible to make CIR locally adaptive?
– In principle, finding the min error is a global optimization that
cannot be locally addressed
• Can one guarantee convergence and correctness of CIR
among sensors?
• Is it possible to have continuous approximations to
address the problem?
• How can one build efficient models in presence of missing
and/or faulty data?
20
Outline
• Sensor Networks, Applications, Challenges
• Coordinated Modeling-Optimization Framework
– Inter-sensor models
– Embedded sensing models
– Optimization for data integrity
• Attack Resilient Location Discovery
– Problem formulation and attack models
– Robust random sample consensus for attack detection
– Evaluation and comparison to competing methods
21
State-of-the-Art Sensing Models
• Parametric models
– Gaussian random fields, graphical models (GM),
message passing, iterative message passing, belief
propagation (BP)
• Nonparametric models
– Marginalized kernels (GM), alternating projections,
distributed EM, nonparametric BP
• Common thread: capture dependence among
sensor data, no edge means no dependence,
• Need to capture the shape of field discontinuities
and/or lack of correlations b/w adjacent nodes
22
Embedded Sensing Models
• Principle of separation of concerns (SoC)
• Example:
Geometric graph (planar-2D)
Delaunay edges (adjacency)
Sensing graph: higher
dimensional embedded graph
1
5
7
4
2
6
8
2
1
6
3
3
8
5
4
7
Idea: Map the sensing graph into lower dimensions.
Exploit the discrepancy between the higher dimensional topology and
the lower dimensional space to identify the obstacles
23
Open Questions
• Efficient computation and handling of embedded
sensing models in higher dimensions
• Joint compression of multiple entities
• How can we capture dynamic topologies, i.e.
mobility, dynamic time series, sleeping
• Efficient structures/data formats for representing
the multi-dimensional topologies
24
Coordinated Modeling and
Optimization
• Paramount importance of interface in system and
software development
• Create statistical models suitable for optimization
–
–
–
–
–
Paradigms: continuous, smooth, consistent
Small number of level sets
Convexity
Bijection x'i= G(F(xi)) = xi, where yi=F(xi) and xi=G(yi)
Transitivity zi = F(xi) = G(yi)
• Create optimization mechanisms resilient to statistical
variability
–
–
–
–
Paradigms: randomization
Multiple validations
Constructive probabilistic
Reweighting of OF and constraints
25
Outline
• Sensor Networks, Applications, Challenges
• Coordinated Modeling-Optimization Framework
– Inter-sensor models
– Embedded sensing models
– Optimization for data integrity
• Attack Resilient Location Discovery
– Problem formulation and attack models
– Robust random sample consensus for attack detection
– Evaluation and comparison to competing methods
26
Data Integrity: Multiple Validations
• The data-integrity problems are complex due to the
complex environments and uncertainties
– Proof of NP-completeness (PhD’05)
• Data integrity (noise reduction, calibration, fault
detection, data recovery) exploits system redundancies
• Coordinated modeling-optimization
• Multiple validations (MV) optimization algorithms
– The solutions are validated using multiple input samples
– Similar in spirit to cross-validation (CV) in statistics
– MV is more comprehensive than CV, since it is a generic
optimization paradigm based on resampling the input space and
validating the output of a complex algorithm rather than a model
27
Example: Missing Data
Temperature (C)
missing/present vs. time
19.2
Present data
19.0
Missing data
18.8
Temperature (C)
• Between 40%-50% missing data
at Intel Berkeley testbed
• Limited A2D: discrete level sets
0
50
100
150
Time
Humidity (%)
missing/present vs. time
39.0
39.5
Present data
38.5
Humidity (%)
Missing data
0
20
40
60
80
100
120
Time
28
140
Missing Data Recovery (MSD)
Problem formulation:
• Given:
2
6
1
5
9
4
3
8
– N sensors s1, …,sN,
7
– Sensor’s data at time t: (d1(t), d2(t),…,dN(t))
– Some sensor data missing in an arbitrary way, i.e.
there is i, such that di(t)=NA
• Objective: recover the missing data in such a
way that the consistency between the readings
of different sensors is maximized (prediction
error is minimized)
29
State-of-the-Art in MSD
• MSD is a prevalent problem in many fields
• Expectation maximization (EM) Dempster et al.’77
– Assuming multivariate density
– Local optimization, likely to be trapped in the local
max of the likelihood function
• Multiple imputations (MI) Rubin 1987
– Missing data replaced by multiple simulated versions
– May distort variable association dues to treating the
completed dataset as the actual one
• Both MI/EM can be computationally intensive
• MV often combines lower dimensional models
30
MV for Missing Data Recovery
• Iteratively select a sub-sample of available nodes (the
present set) and optimize for it
• Remaining nodes (holdout set) used for validating the
solution, quantify its uncertainty
– 1) Randomly assign :{1,…,|V|}{1,…,K};
– 2) for (=1 to =K)
• a: calculate OF-(O);
• b: compute MVC-k(O);
– 3) MVC(O)=G(MVC-k(O)), =1, …, K;
– 4) Obest= argminO MVC(O);
• Advantage: not only a solution, but an uncertainty bound
for the solutions
31
Open Questions
• Theoretical proof of correctness of MV, which of the
properties of CV holds for MV?
• Which MV criteria (MVC) are robust to outliers: e.g.,
order statistics
• Which objective function (OF) to use?
– Ensemble-voting of weak classifiers by boosting (exponential
loss function)
• Real-time implementation on sensor networks testbeds
• Scaling properties of the MV algorithm
32
Outline
• Sensor Networks, Applications, Challenges
• Coordinated Modeling-Optimization Framework
– Inter-sensor models
– Embedded sensing models
– Optimization for data integrity
• Attack Resilient Location Discovery
– Problem formulation and attack models
– Robust random sample consensus for attack detection
– Evaluation and comparison to competing methods
33
Location Discovery*
• A number of nodes have
location data (beacons)
• Other nodes estimate their
distance to beacons to find their
locations
• Many distance estimation
methods (e.g., AoA, ToA)
• If more than three beacons,
node can estimate location
• We focus on the atomic case
(one unknown)
* Joint work with N. Kiyavash, UIUC
s1
s2
s3
s5
s0
s4
s6
s7
s10
s8
s9
34
Robust Location Discovery:
Problem Formulation
• Instance:
– A node s0 with unknown coordinates (x0,y0),
– Set L of location tuples {(xn,yn,dn)} (n beacons),
– Consistency metric (sn,s0), consistency
threshold t
• Problem:
– Find an estimate for (x0,y0) s.t. it is at least
(sn,s0)-consistent with t points in set L
35
Attack Model
• The attackers can modify the distance measurement of
any beacon without any limits
• The network is cryptographically protected against
protocol attackes, e.g., wormhole, sybil
• The measurements from each beacon are only
considered once
• Both independent and coalition (colluding) attacks
• In coalition attacks, the attacking beacons coordinate
their efforts
• There is a minimum number of correct beacons,
otherwise colluding beacons will mislead the target
36
Robust Random Sample
Consensus
1. Initialize i;
2. While (i<imax)
a. Randomly draw a subset Si of size 3 from L;
b. Use Si to estimate s^0;
c. Calculate K, the number of consistent points w.r.t
s^0 in L\Si;
d. If (K>t)
i.
{form a new s^0 from the K points; Terminate;}
e. Increment i;
3. Terminate and output the largest consistent
estimate;
37
Selecting the parameters: imax,t
• q - prob. of correctness of a randomly drawn point
• Expected number of trials, E[i]=1/q3
• - threshold for the prob of missing a good subset, (1q3)imax= Or, imax= ln() / ln(1-q3)
• I – set of inliers; - percentage of inliers
I
• =1-Na/N
3
I j
q
N j
N
3
2
q=3,
E[i]=-9
j 0
• For large datasets
• The number of iterations is
i
max
ln
ln( 1 )
9
38
Evaluation – Random Sample
consensus
Na/N |s^0-s0| FN %
10% 0.06
0
20% 0.07
1
30% 0.07
2.5
40% 0.11
3.5
50% 0.13
3.7
39
Comparison to Other Algorithms
Independent attackers
Colluding attackers
40
Summary
• sensor networks: importance of sensing, data integrity –
missing data, faults, noise, systematic errors
• Coordinated modeling and optimization framework
–
–
–
–
–
Nonparametric models, shape constraints
Multivariate CIR, optimal algorithm, slow in multiple dimensions
Embedded sensing models, separation of concerns
Projection into lower dimensions
Optimization algorithm: multiple validations (MV)
• Attack-resilient location discovery
– 25% more effective in presence of coalition attackers, 35+%
more effective on independent attackers
41
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )