GAMPS-sigmod09 - Microsoft Research

advertisement
GAMPS
COMPRESSING MULTI SENSOR DATA BY
GROUPING & AMPLITUDE SCALING
Sorabh Gandhi, UC Santa Barbara
Suman Nath, Microsoft Research
Subhash Suri, UC Santa Barbara
Jie Liu, Microsoft Research
Fine Grained Sensing & Data Glut


Advances in sensing technology fine grained ubiquitous
sensing of environment
Many applications, but the issue is data glut
Automated Data Center Cooling: [MSFT DCGenome project]

physical parameters ex. humidity, temperature etc

1000s of sensors, 10 bytes/sensors/sec  10s of GBs/day
Server Performance Monitoring: [MSFT server farm monitoring]

performance counters ex. cpu utilization, memory usage etc

100s of counters, 1000s of servers, few bytes/counter/sec TBs/day
Focus and Objectives

Data archival + (reliable and fast) query processing




Obvious solution: compression, data is set of time series




Centralized setting
Point query: report value for sensor x, time t
Similarity query: report sensors ‘similar’ to sensor x in time range
Initial idea: approximate every time series individually
Many approximation techniques known ex. DFT, DCT, piecewise linear
Focus: L1 error [guarantee on point queries]
ex techniques wavelets, piecewise constant/linear approximations
Compression not enough!!

Gives upto an order of magnitude improvement, we want more
Signals are Correlated!
Dynamic
groups
# Connected Users
Shifted/Scaled
groups
Similar signals
in a group
Time

Server dataset: 40 signals, 1 day, sampling once
every 30 seconds, counter: # of connected users
Contributions

We propose GAMPS, which exploits linear correlations
among multiple signals while compressing them
together, and gives L1 guarantees



Compression both along time and across signals
We propose an index structure for compressed data
which can give fast responses to a lot of relevant
queries
Through simulations on real data, we show that on large
datasets, GAMPS can achieve upto an order of
magnitude improvement over state of the art
compression techniques
State of the art: Single Signal
Optimal L1 approximations


Problem: Given a time series S and input parameter ²
approximate S with piecewise constant segments such
that the L1 error is <= ²
Greedy algorithm (PCGreedy(S, ²))
State of the art: Single Signal
Optimal L1 approximations


Problem: Given a time series S and input parameter ²,
approximate S with piecewise constant segments such
that the L1 error is <= ²
Greedy algorithm (PCGreedy(S, ²))
2²
ICDE’03 Lazardis et al.
Original Time Series
Approximation
GAMPS Overview

Data
GAMPS take as input, the set of time series and
approximation parameter ²
Partition
Phase
Grouping
Phase
COMPRESSION

Amplitude
Scaling
Phase
Compressed
Index
Structure
Data
INDEXING
Compression



Partition phase: partitions the data into contiguous time intervals
Group phase: divides a given partition into groups of similar signals
Amplitude scaling phase: compression happens with sharing of
representations
Compression by Amplitude Scaling



Given a group of k ‘similar’ signals
Let the signals be denoted by set X = {X1, X2, …, Xk}
Key idea: express all signals Xi as scaled function of
some signal Xj: Xi = AiXj





Ai is the ratio/amplitude signal and Xj is the base signal
If signal Xi is a perfectly scaled version of Xj then Ai = constant
To reconstruct Xi, we only need to store the constant and Xj
In reality, no perfect correlation
However, we found that if there are enough linearly
correlated signals smartly approximating Ais and Xj
can give very good compression factors!
Illustration: Amplitude Scaling on Real
Dataset
DataCenter Dataset

DataCenter dataset



Input: X = {X1, X2, …, X6}, ² = 1%
Need to choose base signal and divide ² among base
signal (²b) and ratio signal approximations (²r)


6 signals shown for ~3 days each, parameter: relative humidity
Oracle: X4 is base signal, also provides values ²b and ²r
Run PCGreedy(X4, ²b) and PCGreedy(Ai, ²r) for signals
other than the base signal
Illustration: Amplitude Scaling on Real
Dataset
Individual approx
Base signal approx
Ratio signal approx
Y-axis: Relative Humidity
Y-axis: Relative Humidity
Y-axis: Ratio



Leftmost figure, all signals use PCGreedy() with ² = 1.0%
Middle figure, higher fidelity base signal, ²b =0.4%
Rightmost figure: Ratio signals

Very sparse (small number of segments to represent)
Quantitative Comparison for Amplitude
Scaling

Compression factor = M1/M2


M1 = number of segments in individual signal approximations
M2 = number of segments in (base signal + ratio signal) approximations
Comparison with optimal individual approximations

For this illustrative dataset, compression factor
(1% error) is 1.9
Grouping and Amplitude Scaling by
Facility Location

Facility location problem





Problem is modeled as a graph G(V, E)
Opening a facility at node j costs c(j)
Serving a demand point j using facility i
costs w(i,j)
Objective is to choose F µ V
Minimize j 2 F c(i) + i 2 V w(i,j)
Graph
Grouping & amplitude scaling is modeled as facility
location



Complete graph, every signal is represented by a node
Cost opening a facility: # segments needed to represent base signal
Cost of serving a demand point: # segments needed to represent the
ratio signal
Implementation Setup

We set ²b = 0.4² [error allocation for base signal]

Facility location : NP hard




We show results with exact solution (integer linear program)
Approximation solutions are with 90% of the results shown
Time taken to solve the linear program is <= few seconds
We use three different datasets



Server dataset: 240 signals, 1 day data [CPU utilization counter]
DataCenter dataset: 24 signals, 3 days of data [humidity sensors]
IBT dataset: 45 signals, 1 day of data [temperature sensors in a
building in Berkeley]
Quantitative Evaluation: GAMPS

Figure on the left shows compression factor over raw data


Figure on the right: compression factor over individual
approximations


For 1.5% error, 300 for server data, 50 for the other two
For 1.5% error, between factor 2-10
Compression factor high for Server dataset

Average group size is highest (60 as compared to 4.5 & 6)
Scaling versus Group size


We extracted 60 signals in the same group for the
Server dataset
Compression factor (versus individual approximations)
increases as group size increases
Advantage of Grouping



Demonstrate the advantage of having multiple groups
Datasets IBT and Server
Hybrid: algorithm which allows only 1 group


Every signal is either in the group or approximated individually
For both datasets, for all errors, grouping gives great
advantage

Compression Factor: 1.5 (IBT) - 9 (Server) [Error 1.5%]
Grouping: Geographical Locality

IBT dataset, 1 day, error = 1.5%

GAMPS runs the grouping on entire days data



Picture on left shows sensor layout in the Intel Berkeley lab
Hexagons are sensor positions, crosses are sensors without data for
the one day, rectangles are outliers (individual approximations)
Simple region boundaries conform our intuition
Sensor Layout
Group Layout
Indexing Compressed Data
Skip-list of groups
1
2
3
4
5

Ptr. to base signal
Skip-list of approx. lines
for ratio signal
Propose Skip list based index structure



Point query: log(n)
Range query : log(n) + range
Similarity query : log(n) + #groups in range
Future Work


How to distribute error among base and ratio
signals ?
How about generic linear transformations ?



How about piecewise linear signals ?


We use only ratio signal (scaling) : Xi = AiXj
Maybe we can get much better compression by using Xi = AiXj + Bi
Underlying algorithm is not so trivial (convex hulls)
Can we apply this technique to 2D signals ?


Consider a video, every pixel value in time  time series
Every pixel-time-series, correlated with neighboring pixel-timeseries
Thanks for your attention
Example Query: Similarity Query


Based on grouping we can define similarity
coefficient for a given time range (t1, t2)
= 1, if signals Si and Sj are in the same
group at time t
Part of IBT dataset
Similarity Query
Compression by Interval Sharing

Key Idea: If two sensors have near overlapping time
series they can share a part of the approximation
Signal 1
Signal 2


Representation
can be shared
Let number of signals be k and desired error be ²
(®, ¯) approximation algorithm



For given error ² say optimal algorithm taken OPT
(®, ¯) algorithm has error no more than ®² and uses no more than
¯OPT segments
We propose polynomial time (5, log k + log OPT)
approximation algorithm for approximation with PC
segments using interval sharing
Multiple Correlated Signals: Example 1
Server Dataset

Instant messaging service – Server dataset



240 servers, 2 weeks, >= 100 performance counters
40 signals shown (normalized) for one day, counter: #connected users,
sampling rate once in 30 seconds
Signals are correlated (almost overlapping) with each
other, can we exploit this in compression ?
Multiple Correlated Signals: Example 2
DataCenter Dataset

Data center monitoring



Signals not overlapping, but still correlated



24 sensors, 2 years, 2 parameters: humidity, temperature
6 signals shown for ~3 days each, parameter: relative humidity,
sampling rate once every 30 seconds
Shifting or scaling may help
Question: Can we exploit this correlation ?
We propose a technique to compress multiple signals
along both time and across signals
Partition Determination

Use double-half-same size heuristic





Start with some initial batch size (say 100 data points)
For next batch run group and compress with 200, 100 & 50 data points
For 200, compare with two batches of size 100, whichever one takes
less memory is chosen
Similarly for 50, compare two batch sizes of 50 with one batch size 100
Memory taken = # segments + Cluster delta

Cluster delta: Every time clusters change, we need to update the base
signals and base-ratio signal relationships
GAMPS Illustration
Partition
1
2
3
1
2
3
4
4
5
5
Grouping
(Similar signals together)
Base signals
Select Base and
Ratio Signals 1
2
2
4
Ratio signals
1
3
5
3
4
5
GAMPS Compression Illustration
Partition
1
2
3
4
5
(To overcome varying
correlations)
1
2
3
4
5
Grouping
(Similar signals together)
Compress by
Amplitude
Scaling
1
2
3
4
5
Download