Things about Trace Analysis

advertisement
Things about Trace Analysis
Wei-jen Hsu
In class presentation for CIS6930
wjhsu@ufl.edu
(Advisor: Ahmed Helmy)
Objective
• More background knowledge related to
trace-based study
• Details about the trace format – an intro
for one of the assignments
• Share the experience in trace analysis
Why trace analysis?
• Traces provide the “realism” of how the
system work
– Verification of established system
– Diagnosis of system operation (identify faults)
– Identifying design flaws
– Large-scale properties (e.g. self-similar traffic)
– Understand how a new system works
– Provide domain knowledge for analysis work
– Verifying an idea
Typical Work Flow for Trace Analysis
1. Build the system
2. Identify point(s) of trace collection and the
methodology used
3. Obtain the data
4. Clean-up and sanity check
5. Analyze the data and post processing
6. Explain the results
7. Apply the results to further study or modify the
existing system
WLAN Traces Study
• It starts back around 2000
– WLAN was new, people wanted to
understand how people used it (usage
study)
– Surveys v.s. trace
– Work by Tang and Baker (’00), Kotz and
Essien (’02) are pioneer examples
• Statistics of usage (# of users, amount of
traffic, etc.)
WLAN Traces Study
• Mobility-related
– MIT work (home location, prevalence, and
persistence)
– UCSD (PDA users)
– WLAN mobility model (INFOCOM05, T-model,
T++-model)
• Other user properties
– Handoff
– Pause time distribution
Trace Format
• For association
– Usually with format
(Node_id, start_time, location, end_time)
– But with various ways to get you there….
• Syslog: Event-based
• SNMP: Polling
• USC raw trace
– Wireless association (time start/stop switchport MAC)
– DHCP log (time MAC IP)
– Traffic log
Trace Format Example
• USC wireless association trace
(Time Start/Stop Switch_IP Switch_port MAC_of_node)
Mon
Mon
Mon
Oct
Oct
Oct
10
10
10
01:16:52
01:17:00
01:17:02
Start 172.16.8.245 31005 0:30:65:f9:c0:ae
Stop 172.16.8.245 21044 0:e:35:99:64:d1
Start 172.16.8.245 31015 0:11:24:df:c0:3a
• USC DHCP trace
(Time IP_of_node MAC_of_node)
Jan
Jan
Jan
27
27
27
00:21:19
00:21:20
00:21:20
207.151.229.50 0:18:f3:10:ea:4c
207.151.232.184 0:18:de:33:7:92
207.151.229.50 0:18:f3:10:ea:4c
• USC traffic trace
(Start_time End_time Destination_IP_port Source_IP_port protocol(TCP=6, UDP=17) “?”
Packet_number Data_size)
0127.23:59:42.925
0
3
1368
0127.23:59:42.925
2
4
192
0127.23:59:44.905
128.125.253.143 53
0127.23:59:52.677
63.236.56.237 80
207.151.239.208 1795
207.151.239.208 3257
17
6
Work with the Trace
• An exercise:
“Does the Encounter-Relationship graph
change with respect to time??”
• From WLAN traces,
We find “encounters” to measure internode relationship
Note: Is this a good assumption??
Encounter distribution
Prob. (unique encounter fraction > x)
• How many other nodes does a node encounter
with?
0.5
Not many for WLAN users. On avg. only 2%~7% of population
Encounter-Relationship graph
• Imagine that there is a link to connect the
node pairs if they ever encounter with each
other … What does the graph look like?
loner
Group of good friends…
But, is ER graph
a connected graph?
What are its properties?
Cliques with random links to join them
Encounter-Relationship graph
Disconnected Ratio (%)
• To our surprise, ER graphs are connected!!
In most cases DR reaches close to final value in less than 1 day.
Encounter-Relationship graph
• What are the graph properties of the
relationship graphs?
SmallWorld graph
Regular Graph
- High path length
- High clustering
High clustering as regular graph
Low path length as random graph
Random Graph
- Low path length,
- Low clustering
Encounter-Relationship graph
• Relationship graphs are SmallWorld graph
Normalized CC and PL
– High clustering coefficient, low avg. path
length
Work with the Trace
• An exercise:
“Does the Encounter-Relationship graph
change with respect to time??”
– Chop the trace into multiple segments
– Analyze the average clustering coefficient and
average path length of the resultant graph
– How to deal with changing population?
– Does the encounter duration matter?
Work with the Trace
• Ask questions! What to look for from the
trace?
– Its importance
– Its implication
– Its potential usage
– Its alternative solutions
• Apply new techniques to look into the data
• Find/Create interesting data sets
Lessons Learned
• You need a lot of patience and care
– Exceptions in the data
– Flaws in your assumption
• You need a lot of hard-drive space too!
• You need good questions
– For each question there are multiple ways to
come up with an answer
– New questions require new data sets and
tools
• You need to read a lot of papers
More Potential Direction
•
•
•
•
Mobility modeling/prediction
Data mining and clustering
Behavior-aware service/advertisements
Behavior-aware routing
– Caveat: Over-generalization from WLAN to
futuristic networks (such as DTN)?
• Re-examine assumptions in earlier work
Related Skills
•
•
•
•
General programming (C/C++)
Perl/shell script/awk
Matrix manipulation (MATLAB)
Statistics software (R)
– http://www.r-project.org/
• Clustering/Machine learning
• Principal component analysis/ Singular
value decomposition
– http://www.cs.cmu.edu/~elaw/papers/pca.pdf
• Data mining? Database analysis?
Good Online Resources
• MobiLib
http://nile.cise.ufl.edu/MobiLib
– Links to various traces, USC trace and some processing tools
download
• CRAWDAD
http://crawdad.cs.dartmouth.edu/
– Various traces download, related papers
References
• [Stanford] D. Tang and M. Baker, “Analysis of a
Local-area Wireless Network”
• [Stanford2] D. Tang and M. Baker, “Analysis of a
Metropolitan-area Wireless Network”
• [Dartmouth] D. Kotz and K. Essien, “Analysis of
a Campus-wide Wireless Network”
• [Dartmouth2] T. Henderson, D. Kotz, and I.
Abyzov, “The Changing Usage of a Mature
Campus-wide Wireless Network”
• [MIT/IBM] M. Balazinska and P. Castro,
“Characterizing Mobility and Network Usage in a
Corporate Wireless Local-area Network”
References
• [UCSD] M. McNett and G. Voelker, “Access and
Mobility of Wireless PDA Users”
• [UCLA] X. Meng, S. Wong, Y. Yuan, and S. Lu,
“Characterizing Flows in Large Wireless Data
Networks”
• [USC] D. Bhattacharjee, A. Rao, C. Shah, M.
Shah, and A. Helmy, “Empirical Modeling of
Campus-wide Pedestrian Mobility: Observations
on the USC Campus”
• [USC2] K. Merchant, W. Hsu, H. Shu, C. Hsu,
and A. Helmy, “Weighted Waypoint Mobility
Model and Its Impacts on Ad Hoc Networks”
References
• [Dartmouth] M. Kim and D Kotz, “Methodology for
Classifying Mobile Users and Access Points”
• [Dartmouth] L. Song, D. Kotz, R. Jain, and X. He,
“Evaluating location predictors with extensive Wi-Fi
mobility data”
• [SIGCOMM01] A. Balachandran, G. Voelker, P. Bahl, and
V. Rangan, “Characterizing User Behavior and Network
Performance in a Public Wireless LAN”
• [INFOCOM05] C. Tuduce and T. Gross, “A Mobility
Model Based on WLAN Traces and its Validation”
• [T++-model] D Lelescu, UC Kozat, R Jain, M
Balakrishnan, “Model T++: an empirical joint space-time
registration model”
• [T-model] R Jain, D Lelescu, M Balakrishnan, “Model T:
an empirical model for user registration patterns in a
campus wireless LAN”
More on Mobility Modeling
Mobility Observations from WLANs
• Skewed location
visiting preferences
– Nodes spend 95% of
time at top 5 preferred
locations.
– Heavily visited
“preferred spots”
• Periodical reappearance
– Nodes show up
repeatedly at the same
location after integer
multiples of days.
– Periodical “daily/weekly
schedules”
Mobility Observations from WLANs
• Problems of simple random models (random
walk, random waypoint, random direction)
– No preferred locations in spatial domain (uniform
nodal distribution across space)
– No structure in time domain (homogeneous behavior
across time)
– Nodes behave statistically identical to one another
• Benefit: Math analysis tractability
• Can we improve realism and not sacrifice math
tractability?
Time-variant Community Model
• Skewed location visiting preferences
– Create “communities” to be the preferred
destination
– Each node can have its own community
• Periodical re-appearance
– Create structure in time – Periods
– Node move with different
parameters in periods
– Repetitive structure
75%
25%
Time-variant Community Model
time
onlinetime
fractionofofonline
Avg.
Avg.fraction
AP sorted by total amount of time associated with it
by total amount
with it
1 AP sorted 11
21 of time associated
31
1
11 21 31 41 51 61 71 81 91
0.8
1.E+00
0.7
1.E-01
0.6
1.E-02
0.5
1.E-03
0.4
0.3
1.E-04
0.2
1.E-05
0.1
1.E-06
0
MIT-trace
Model-simplified
MIT-trace
Model-simplified
Prob of re-appearance
• Major trends of mobility characteristics
preserved (extensions later)
0.3
0.25
0.2
Model-simplified
0.15
MIT-trace
0.1
0.05
0
0
2
4
6
Time gap (days)
• In addition, mathematical tractability is retained
8
More on Matrix-based Analysis
Introduction
• Wide-spread WLAN deployments create largescale infrastructures.
– Large number of users lead to large scale
management and design issues.
• We need methods to quantify, summarize, and
compare long-run trends (in the order of months)
of individual user associations
–
–
–
–
Usage model / association model
Personalized services
Behavior aware ads / monetization
Behavior-aware routing protocols
Questions
• Q1. How to quantify user association consistency?
– (Challenge) What is a proper representation of user association, and
how do we measure consistency?
• Q2. How do we summarize long run user association
patterns?
– (Challenge) How to utilize existing data reduction techniques?
• Q3. How to group users with similar association patterns?
– (Challenge) How to quantify the similarity of user association
patterns?
– How to reduce computational complexity?
• Contribution: Generic methods to address these questions
and empirically validated using USC and Dartmouth WLAN
traces.
Representation of User Association Patterns
(library, 1:30PM-2:30PM)
(office, 10AM-12PM)
(class, 6PM-8PM)
Association vector:
(library, office, class) =(0.2, 0.4, 0.4)
• We choose to represent summary of user association in each day by
a single vector.
• For a given day d, user association vector is defined by a n-element
vector a = {aj : the percentage of online time the user i spends at APj
on day d}.
– The elements of a vector sum to 1.
a  a1 a2  an
– Use zero vector for off-line users.
• The elements in the vectors quantify the relative importance (or,
attraction) of the AP to the user.


Q1. User Association Consistency
• User i is consistent, if its daily association
vectors can be grouped into few clusters
(e.g., less than 10% of the number of days).
• Evaluation: use hierarchical clustering with
Manhattan
distance measure (L1)
n
– D(a, b)   ai  bi
i 1
– Distance between two vectors is at most 2.
Q1. User Association Consistency
• Hierarchical Clustering
– Start: Each vector is a single-member cluster.
– Recursion: Two closest clusters are merged.
– End: Until remaining clusters have distances
larger than a threshold
Q1. User Association Consistency
Distribution of Number of
clusters under cut-off
threshold 0.9
80% of users show at most
9 clusters of “behavior modes”
during the 94-day trace
*complete link: Distance between clusters =
distance between the furthest components in
the considered clusters
Observation: many users are multimodal but with
much less association modes than total number
of days in the trace period.
Q2. Summarizing user associations
• Association matrix: concatenate user association
vectors for all days into a matrix.
• To summarize, perform SVD and store the top-k
eigen values/vectors.
• What value of k we have to use for a good
representation of the matrix?
k
d
– Captured matrix power =   i
i i
2
 
 i ,  i  i - th singular v alue
2
i i
• How much is the reconstruction error?
– Matrix norms ||X-Xk||p/||X||p
where
p
X
p
p
X
( i , j )
(i , j )
Daily association vector
Q2. Summarizing user associations
Only top 6 singular vectors
are needed to capture at least
90% of power for more than
95% of association matrices
Reconstruction error of
low-rank approximation
is low (5 singular vectors
give error < 0.05)
Observation: although users are multi-modal,
a few major modes dominate its behavior
Q2. Summarizing user associations
• Association matrix: concatenate user association vectors for all days
into a matrix.
• To summarize, perform SVD and store the top-k eigen values/vectors.
• What value of k we have to use for a good representation of the
matrix?
– Captured matrix power =
• How much is the reconstruction error?
– Matrix norms ||X-Xk||p/||X||p
where
k
 i
i i
X
p

p
X
( i , j )
2
d
 
 i ,  i  i - th singular v alue
2
i i
p
(i , j )
Daily association vector
Q2. Summarizing user associations
Only top 6 singular vectors
are needed to capture at least
90% of power for more than
95% of association matrices
Reconstruction error of
low-rank approximation
is low (5 singular vectors
give error < 0.05)
Observation: although users are multi-modal,
a few major modes dominate its behavior
Q3. Similarity Metrics between Users
• Naive method to compare similarity between
user i and j:
– Intuition: for every daily association vector of i, if there
is a similar association vector for j, then (i,j) have
similar behavior.
– From user i, pick association vector aid of user i on
day d.
– Find the association vector of user j, denoted by ajd’ ,
which is the nearest to aid
• Find average of |ajd’ - aid| over all days d.
• Drawback: expensive
– O(nd^2) for each pair
– Lots of file reads for large dataset …. Read raw data
• Need a faster method which reads summaries
Q3. Similarity Metrics between Users
• Compare the similarity of the eigenvectors obtained from SVD.
• Similarity between users determined by
weighted inner products of eigen vectors.
– Sim(U ,V )   wi w j ui  v j
i , j
– wi = proportion of power of singular vector
– D(U,V) = 1 - Sim(U,V)
• Are the 2 metrics similar?
– 0.911 correlation coefficient for studied users.
Q3. Similarity Metrics between Users
• Are we able to get clusters with similar
users?
• Compare the PDF/CDF for inter- and intracluster users (Example: 200 clusters).
Q3. Similarity Metrics between Users
• Take users in the same
clusters and concatenate the
asso. matrices, and perform
SVD and find power captured
by top k eigen vectors.
• Also take random users and
concatenate the eigenvectors
and do the same.
• There is a clear distinction
between the 2 clustering *straight-forward = similarity decided based on
pair-wise comparison of association vectors
methods.
*feature-based = similarity decided based on
singular vectors
Q3. Similarity Metrics between Users
• For all clusters, use a scatter plot to show
the power captured by top-4 eigenvectors.
(distance-based cluster vs random cluster)
Download