Things about Trace Analysis Wei-jen Hsu In class presentation for CIS6930 wjhsu@ufl.edu (Advisor: Ahmed Helmy) Objective • More background knowledge related to trace-based study • Details about the trace format – an intro for one of the assignments • Share the experience in trace analysis Why trace analysis? • Traces provide the “realism” of how the system work – Verification of established system – Diagnosis of system operation (identify faults) – Identifying design flaws – Large-scale properties (e.g. self-similar traffic) – Understand how a new system works – Provide domain knowledge for analysis work – Verifying an idea Typical Work Flow for Trace Analysis 1. Build the system 2. Identify point(s) of trace collection and the methodology used 3. Obtain the data 4. Clean-up and sanity check 5. Analyze the data and post processing 6. Explain the results 7. Apply the results to further study or modify the existing system WLAN Traces Study • It starts back around 2000 – WLAN was new, people wanted to understand how people used it (usage study) – Surveys v.s. trace – Work by Tang and Baker (’00), Kotz and Essien (’02) are pioneer examples • Statistics of usage (# of users, amount of traffic, etc.) WLAN Traces Study • Mobility-related – MIT work (home location, prevalence, and persistence) – UCSD (PDA users) – WLAN mobility model (INFOCOM05, T-model, T++-model) • Other user properties – Handoff – Pause time distribution Trace Format • For association – Usually with format (Node_id, start_time, location, end_time) – But with various ways to get you there…. • Syslog: Event-based • SNMP: Polling • USC raw trace – Wireless association (time start/stop switchport MAC) – DHCP log (time MAC IP) – Traffic log Trace Format Example • USC wireless association trace (Time Start/Stop Switch_IP Switch_port MAC_of_node) Mon Mon Mon Oct Oct Oct 10 10 10 01:16:52 01:17:00 01:17:02 Start 172.16.8.245 31005 0:30:65:f9:c0:ae Stop 172.16.8.245 21044 0:e:35:99:64:d1 Start 172.16.8.245 31015 0:11:24:df:c0:3a • USC DHCP trace (Time IP_of_node MAC_of_node) Jan Jan Jan 27 27 27 00:21:19 00:21:20 00:21:20 207.151.229.50 0:18:f3:10:ea:4c 207.151.232.184 0:18:de:33:7:92 207.151.229.50 0:18:f3:10:ea:4c • USC traffic trace (Start_time End_time Destination_IP_port Source_IP_port protocol(TCP=6, UDP=17) “?” Packet_number Data_size) 0127.23:59:42.925 0 3 1368 0127.23:59:42.925 2 4 192 0127.23:59:44.905 128.125.253.143 53 0127.23:59:52.677 63.236.56.237 80 207.151.239.208 1795 207.151.239.208 3257 17 6 Work with the Trace • An exercise: “Does the Encounter-Relationship graph change with respect to time??” • From WLAN traces, We find “encounters” to measure internode relationship Note: Is this a good assumption?? Encounter distribution Prob. (unique encounter fraction > x) • How many other nodes does a node encounter with? 0.5 Not many for WLAN users. On avg. only 2%~7% of population Encounter-Relationship graph • Imagine that there is a link to connect the node pairs if they ever encounter with each other … What does the graph look like? loner Group of good friends… But, is ER graph a connected graph? What are its properties? Cliques with random links to join them Encounter-Relationship graph Disconnected Ratio (%) • To our surprise, ER graphs are connected!! In most cases DR reaches close to final value in less than 1 day. Encounter-Relationship graph • What are the graph properties of the relationship graphs? SmallWorld graph Regular Graph - High path length - High clustering High clustering as regular graph Low path length as random graph Random Graph - Low path length, - Low clustering Encounter-Relationship graph • Relationship graphs are SmallWorld graph Normalized CC and PL – High clustering coefficient, low avg. path length Work with the Trace • An exercise: “Does the Encounter-Relationship graph change with respect to time??” – Chop the trace into multiple segments – Analyze the average clustering coefficient and average path length of the resultant graph – How to deal with changing population? – Does the encounter duration matter? Work with the Trace • Ask questions! What to look for from the trace? – Its importance – Its implication – Its potential usage – Its alternative solutions • Apply new techniques to look into the data • Find/Create interesting data sets Lessons Learned • You need a lot of patience and care – Exceptions in the data – Flaws in your assumption • You need a lot of hard-drive space too! • You need good questions – For each question there are multiple ways to come up with an answer – New questions require new data sets and tools • You need to read a lot of papers More Potential Direction • • • • Mobility modeling/prediction Data mining and clustering Behavior-aware service/advertisements Behavior-aware routing – Caveat: Over-generalization from WLAN to futuristic networks (such as DTN)? • Re-examine assumptions in earlier work Related Skills • • • • General programming (C/C++) Perl/shell script/awk Matrix manipulation (MATLAB) Statistics software (R) – http://www.r-project.org/ • Clustering/Machine learning • Principal component analysis/ Singular value decomposition – http://www.cs.cmu.edu/~elaw/papers/pca.pdf • Data mining? Database analysis? Good Online Resources • MobiLib http://nile.cise.ufl.edu/MobiLib – Links to various traces, USC trace and some processing tools download • CRAWDAD http://crawdad.cs.dartmouth.edu/ – Various traces download, related papers References • [Stanford] D. Tang and M. Baker, “Analysis of a Local-area Wireless Network” • [Stanford2] D. Tang and M. Baker, “Analysis of a Metropolitan-area Wireless Network” • [Dartmouth] D. Kotz and K. Essien, “Analysis of a Campus-wide Wireless Network” • [Dartmouth2] T. Henderson, D. Kotz, and I. Abyzov, “The Changing Usage of a Mature Campus-wide Wireless Network” • [MIT/IBM] M. Balazinska and P. Castro, “Characterizing Mobility and Network Usage in a Corporate Wireless Local-area Network” References • [UCSD] M. McNett and G. Voelker, “Access and Mobility of Wireless PDA Users” • [UCLA] X. Meng, S. Wong, Y. Yuan, and S. Lu, “Characterizing Flows in Large Wireless Data Networks” • [USC] D. Bhattacharjee, A. Rao, C. Shah, M. Shah, and A. Helmy, “Empirical Modeling of Campus-wide Pedestrian Mobility: Observations on the USC Campus” • [USC2] K. Merchant, W. Hsu, H. Shu, C. Hsu, and A. Helmy, “Weighted Waypoint Mobility Model and Its Impacts on Ad Hoc Networks” References • [Dartmouth] M. Kim and D Kotz, “Methodology for Classifying Mobile Users and Access Points” • [Dartmouth] L. Song, D. Kotz, R. Jain, and X. He, “Evaluating location predictors with extensive Wi-Fi mobility data” • [SIGCOMM01] A. Balachandran, G. Voelker, P. Bahl, and V. Rangan, “Characterizing User Behavior and Network Performance in a Public Wireless LAN” • [INFOCOM05] C. Tuduce and T. Gross, “A Mobility Model Based on WLAN Traces and its Validation” • [T++-model] D Lelescu, UC Kozat, R Jain, M Balakrishnan, “Model T++: an empirical joint space-time registration model” • [T-model] R Jain, D Lelescu, M Balakrishnan, “Model T: an empirical model for user registration patterns in a campus wireless LAN” More on Mobility Modeling Mobility Observations from WLANs • Skewed location visiting preferences – Nodes spend 95% of time at top 5 preferred locations. – Heavily visited “preferred spots” • Periodical reappearance – Nodes show up repeatedly at the same location after integer multiples of days. – Periodical “daily/weekly schedules” Mobility Observations from WLANs • Problems of simple random models (random walk, random waypoint, random direction) – No preferred locations in spatial domain (uniform nodal distribution across space) – No structure in time domain (homogeneous behavior across time) – Nodes behave statistically identical to one another • Benefit: Math analysis tractability • Can we improve realism and not sacrifice math tractability? Time-variant Community Model • Skewed location visiting preferences – Create “communities” to be the preferred destination – Each node can have its own community • Periodical re-appearance – Create structure in time – Periods – Node move with different parameters in periods – Repetitive structure 75% 25% Time-variant Community Model time onlinetime fractionofofonline Avg. Avg.fraction AP sorted by total amount of time associated with it by total amount with it 1 AP sorted 11 21 of time associated 31 1 11 21 31 41 51 61 71 81 91 0.8 1.E+00 0.7 1.E-01 0.6 1.E-02 0.5 1.E-03 0.4 0.3 1.E-04 0.2 1.E-05 0.1 1.E-06 0 MIT-trace Model-simplified MIT-trace Model-simplified Prob of re-appearance • Major trends of mobility characteristics preserved (extensions later) 0.3 0.25 0.2 Model-simplified 0.15 MIT-trace 0.1 0.05 0 0 2 4 6 Time gap (days) • In addition, mathematical tractability is retained 8 More on Matrix-based Analysis Introduction • Wide-spread WLAN deployments create largescale infrastructures. – Large number of users lead to large scale management and design issues. • We need methods to quantify, summarize, and compare long-run trends (in the order of months) of individual user associations – – – – Usage model / association model Personalized services Behavior aware ads / monetization Behavior-aware routing protocols Questions • Q1. How to quantify user association consistency? – (Challenge) What is a proper representation of user association, and how do we measure consistency? • Q2. How do we summarize long run user association patterns? – (Challenge) How to utilize existing data reduction techniques? • Q3. How to group users with similar association patterns? – (Challenge) How to quantify the similarity of user association patterns? – How to reduce computational complexity? • Contribution: Generic methods to address these questions and empirically validated using USC and Dartmouth WLAN traces. Representation of User Association Patterns (library, 1:30PM-2:30PM) (office, 10AM-12PM) (class, 6PM-8PM) Association vector: (library, office, class) =(0.2, 0.4, 0.4) • We choose to represent summary of user association in each day by a single vector. • For a given day d, user association vector is defined by a n-element vector a = {aj : the percentage of online time the user i spends at APj on day d}. – The elements of a vector sum to 1. a a1 a2 an – Use zero vector for off-line users. • The elements in the vectors quantify the relative importance (or, attraction) of the AP to the user. Q1. User Association Consistency • User i is consistent, if its daily association vectors can be grouped into few clusters (e.g., less than 10% of the number of days). • Evaluation: use hierarchical clustering with Manhattan distance measure (L1) n – D(a, b) ai bi i 1 – Distance between two vectors is at most 2. Q1. User Association Consistency • Hierarchical Clustering – Start: Each vector is a single-member cluster. – Recursion: Two closest clusters are merged. – End: Until remaining clusters have distances larger than a threshold Q1. User Association Consistency Distribution of Number of clusters under cut-off threshold 0.9 80% of users show at most 9 clusters of “behavior modes” during the 94-day trace *complete link: Distance between clusters = distance between the furthest components in the considered clusters Observation: many users are multimodal but with much less association modes than total number of days in the trace period. Q2. Summarizing user associations • Association matrix: concatenate user association vectors for all days into a matrix. • To summarize, perform SVD and store the top-k eigen values/vectors. • What value of k we have to use for a good representation of the matrix? k d – Captured matrix power = i i i 2 i , i i - th singular v alue 2 i i • How much is the reconstruction error? – Matrix norms ||X-Xk||p/||X||p where p X p p X ( i , j ) (i , j ) Daily association vector Q2. Summarizing user associations Only top 6 singular vectors are needed to capture at least 90% of power for more than 95% of association matrices Reconstruction error of low-rank approximation is low (5 singular vectors give error < 0.05) Observation: although users are multi-modal, a few major modes dominate its behavior Q2. Summarizing user associations • Association matrix: concatenate user association vectors for all days into a matrix. • To summarize, perform SVD and store the top-k eigen values/vectors. • What value of k we have to use for a good representation of the matrix? – Captured matrix power = • How much is the reconstruction error? – Matrix norms ||X-Xk||p/||X||p where k i i i X p p X ( i , j ) 2 d i , i i - th singular v alue 2 i i p (i , j ) Daily association vector Q2. Summarizing user associations Only top 6 singular vectors are needed to capture at least 90% of power for more than 95% of association matrices Reconstruction error of low-rank approximation is low (5 singular vectors give error < 0.05) Observation: although users are multi-modal, a few major modes dominate its behavior Q3. Similarity Metrics between Users • Naive method to compare similarity between user i and j: – Intuition: for every daily association vector of i, if there is a similar association vector for j, then (i,j) have similar behavior. – From user i, pick association vector aid of user i on day d. – Find the association vector of user j, denoted by ajd’ , which is the nearest to aid • Find average of |ajd’ - aid| over all days d. • Drawback: expensive – O(nd^2) for each pair – Lots of file reads for large dataset …. Read raw data • Need a faster method which reads summaries Q3. Similarity Metrics between Users • Compare the similarity of the eigenvectors obtained from SVD. • Similarity between users determined by weighted inner products of eigen vectors. – Sim(U ,V ) wi w j ui v j i , j – wi = proportion of power of singular vector – D(U,V) = 1 - Sim(U,V) • Are the 2 metrics similar? – 0.911 correlation coefficient for studied users. Q3. Similarity Metrics between Users • Are we able to get clusters with similar users? • Compare the PDF/CDF for inter- and intracluster users (Example: 200 clusters). Q3. Similarity Metrics between Users • Take users in the same clusters and concatenate the asso. matrices, and perform SVD and find power captured by top k eigen vectors. • Also take random users and concatenate the eigenvectors and do the same. • There is a clear distinction between the 2 clustering *straight-forward = similarity decided based on pair-wise comparison of association vectors methods. *feature-based = similarity decided based on singular vectors Q3. Similarity Metrics between Users • For all clusters, use a scatter plot to show the power captured by top-4 eigenvectors. (distance-based cluster vs random cluster)