הטכניון -מכון טכנולוגי לישראל TECHNION - ISRAEL INSTITUTE OF TECHNOLOGY הפקולטה להנדסת חשמל המעבדה לבקרה רובוטיקה ולמידה חישובית דוח פרויקט :פרויקט א הנושא: התנהגות משתמשים ברשת אלחוטית מגישה: אנה רוזנברג מנחה: אורלי אבנר סמסטר :אביב שנה2012 : 1 User Behavior Analysis in Wi-Fi network By: Anna Rosenberg Supervisor: Orly Avner Date: Spring Semester 2012 2 Contents Abstract ......................................................................................................................................................... 6 Literature Review .......................................................................................................................................... 6 Data ............................................................................................................................................................. 13 Access Points Analysis ............................................................................................................................... 14 IEEE 802.11 Architecture ........................................................................................................................ 14 Arrival Rate.............................................................................................................................................. 15 Users ........................................................................................................................................................... 29 Visit duration........................................................................................................................................... 33 Features .................................................................................................................................................. 39 Clustering ................................................................................................................................................ 45 Possible Applications: ......................................................................................................................... 45 Distance Measure ............................................................................................................................... 46 K-Means Clustering ................................................................................................................................. 47 K-Means Results ...................................................................................................................................... 47 G-Means Algorithm ................................................................................................................................. 53 Evaluation of clustering .......................................................................................................................... 58 Conclusions ................................................................................................................................................. 62 Bibliography ................................................................................................................................................ 62 3 Figure 1 - Arrival rate of AP 1 ...................................................................................................................... 16 Figure 2 - Arrival rate of AP 2 ...................................................................................................................... 16 Figure 3 - Arrival rate of AP 3 ...................................................................................................................... 17 Figure 4 - Arrival rate of AP 4 ...................................................................................................................... 17 Figure 5 - Arrival rate of AP 5 ...................................................................................................................... 18 Figure 6 - Arrival rate of AP 6 ...................................................................................................................... 18 Figure 7 - Arrival rate of AP 7 ...................................................................................................................... 19 Figure 8 - Arrival rate of AP 8 ...................................................................................................................... 19 Figure 9 - Arrival rate of AP 9 ...................................................................................................................... 20 Figure 10 - Arrival rate of AP 10 .................................................................................................................. 20 Figure 11 - Arrival rate of AP 11 .................................................................................................................. 21 Figure 12 - Arrival rate of AP 12 .................................................................................................................. 21 Figure 13 - Arrival rate of AP 13 .................................................................................................................. 22 Figure 14 - Arrival rate of AP 14 .................................................................................................................. 22 Figure 15 - Arrival rate of AP 15 .................................................................................................................. 23 Figure 16 - Arrival rate of AP 16 .................................................................................................................. 23 Figure 17 - Arrival rate of AP 1 with averaging window of 0.1 hour........................................................... 24 Figure 18 - Arrival rate of AP 1 with averaging window of 0.2 hour........................................................... 25 Figure 19 - Arrival rate of AP 1 with averaging window of 0.25 hour......................................................... 25 Figure 20 - Arrival rate of AP 1 with averaging window of 0.3 hour........................................................... 26 Figure 21 - Arrival rate of AP 1 with averaging window of 0.35 hour......................................................... 26 Figure 22 - Arrival rate of AP 1 with averaging window of 0.4 hour........................................................... 27 Figure 23 - Arrival rate of AP 1 with averaging window of 0.5 hour........................................................... 27 Figure 24 - Arrival rate of AP 1 with different averaging windows............................................................. 28 Figure 25 -Transmission rate of user 1 during the day with averaging window 0f 1 hour ......................... 29 Figure 26 - Transmission rate of user 2 during the day with averaging window 0f 1 hour ........................ 30 Figure 27 - Transmission rate of user 3 during the day with averaging window 0f 1 hour ........................ 31 Figure 28 - Transmission rate of user 2 during the week with averaging window 0f 1 hour ..................... 32 Figure 29 - Transmission rate of user 3 during the week with averaging window 0f 1 hour ..................... 33 Figure 30 - Inter-arrival times of packets from 9:30 am till 11:18 am ........................................................ 34 Figure 31 - Inter-arrival times of packets from 12:30 pm till 16:53 pm...................................................... 35 Figure 32 - Inter-arrival times of packets during the day ........................................................................... 35 Figure 33 - Histogram of user's visits widespread ...................................................................................... 36 Figure 34 - Histogram of user's visits widespread ...................................................................................... 36 Figure 35 - Histogram of user's visits widespread ...................................................................................... 37 Figure 36 - Histogram of user's visits widespread ...................................................................................... 37 Figure 37 - Histogram of user's visits widespread ...................................................................................... 38 Figure 38 - Average inter-visits times vs. Average visit duration................................................................ 39 4 Figure 39 - Average inter-visits times vs. Number of visits ......................................................................... 40 Figure 40 - Average traffic vs. Average visit duration ................................................................................. 40 Figure 41 - Average traffic vs. Number of visits .......................................................................................... 41 Figure 42 - Average visit duration vs. Number of visits .............................................................................. 41 Figure 43 - Average Inter visits times vs. Total days in system ................................................................... 42 Figure 44 - Standard deviation of inter-arrival times of visits .................................................................... 43 Figure 45 - Standard deviation of traffic per packets ................................................................................. 43 Figure 46 - Standard deviation of visit duration times ............................................................................... 44 Figure 47 - k-means results when k=2: Average visit duration vs. Average inter-visits times.................... 48 Figure 48 - k-means results when k=2: Average visit duration vs. Average inter-visits times vs. Average traffic per packets ....................................................................................................................................... 48 Figure 49 - k-means results when k=2: Average inter-visits times vs. Average traffic per packets............ 49 Figure 50 - k-means results when k=2: Maximal distance between visits vs. Minimal distance between visits ............................................................................................................................................................ 49 Figure 51 - k-means results when k=3: Average visit duration vs. Average inter-visits times.................... 50 Figure 52 - k-means results when k=3: Average inter-visits times vs. Average traffic per packets............ 50 Figure 53 - k-means results when k=3: Maximal distance between visits vs. Minimal distance between visits ............................................................................................................................................................ 51 Figure 54 - k-means results when k=4: Average visit duration vs. Average inter-visits times.................... 51 Figure 55 - k-means results when k=4: Average inter-visits times vs. Average traffic per packets............ 52 Figure 56 - k-means results when k=4: Maximal distance between visits vs. Minimal distance between visits ............................................................................................................................................................ 52 Figure 57 - Number of clusters produced by the g-means algorithm vs. Significance level α ................... 55 Figure 58 - User's Clusters produced by the g-means algorithm................................................................ 56 Figure 59 - User's Clusters produced by the g-means algorithm ................................................................ 57 Figure 60 - User's Clusters produced by the g-means algorithm ................................................................ 58 Figure 61 - Evaluation measure Purity vs. significance level α ................................................................... 60 Figure 62 - Evaluation measure E vs. significance level α ........................................................................... 61 5 Abstract Wireless networks are increasingly being deployed and expanded in airports, universities, corporations, hospitals, residential, and other public areas to provide wireless Internet access. Modeling how wireless clients arrive at different APs, how long they stay at them, and the amount of data they access can be beneficial in capacity planning, administration and deployment of wireless infrastructures, protocol design for wireless applications and services, and their performance analysis. A better understanding of the arrival rate of clients at APs can also assist in forecasting the traffic demand at APs. Short-term (e.g., a few minutes) forecasting can be employed in the design of more energy-efficient clients and resource reservation and load balancing (among APs) mechanisms. Long-term forecasting is essential for capacity planning and understanding the evolution of the wireless traffic and networks. Understanding and forecasting the access patterns at APs can have a dominant impact on the operation of wireless APs. The goal of this project is to analyze a Wi-Fi network’s APs and to model the wireless clients using it. The contributions of this project are the following: the analysis of Access Points of wireless network, the use of k-means and g-means algorithms for clustering the network’s users. Literature Review At the first step of the project we analyzed previous related research in the field of network behavior studies. 1. "Modeling client arrivals at access points in wireless campus-wide networks (Maria Papadopouli, Haipeng Shen, Manolis Spanakis)" The goal of this study is to model the arrival of wireless clients at the access points (APs) in a production 802.11 infrastructure. Time-varying Poisson processes can model the arrival processes of clients at APs well and they validate these results by modeling the visit arrivals at different time intervals 6 and APs. They investigate the traffic load characteristics (e.g., bytes, number of packets, associations, distinct clients, type of clients), their dependencies and interplay in various time-scales, from both the perspective of a client and an access point (AP). The main contributions of this work are the following: a novel methodology for modeling the arrival processes of clients at wireless APs, the use of a very powerful visualization tool (the SiZer map) for finding detailed interior features and quantile plots with simulation envelope for goodness-of-fit test, models of the arrival processes of clients at APs as a time-varying Poisson process with different arrival-rate function to model their arrival at an AP. Furthermore, they investigate the impact of the type of building (i.e., its functionality) in which the AP is located at the arrival rate and cluster these visit arrival models based on the building type. The conclusions that were made in this work: time-varying Poisson processes can model the arrival processes of clients at APs well; it is possible to cluster the APs based on their visit arrival and functionality of the area in which these APs are located. 2. Characterizing user behavior and network performance in a public wireless LAN. In Proceedings of the ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems, 2002. (Anand Balachandran, Geoffrey Voelker, Paramvir Bahl, and VenkatRangan) In this paper previous studies were extended by presenting and analyzing user behavior and network performance in a public-area wireless network using a trace recorded over three days at the ACM SIGCOMM’01 conference held at U.C. San Diego in August 2001. The trace consisted of two parts. The first part is a record of performance monitoring data sampled from wireless access points (APs) serving the conference and the second consists of anonymized packet headers of all wireless traffic. Both parts of the trace span the three days of the conference, capturing the workload of 300,000 flows from 195 users consuming 4.6 GB of bandwidth. In this paper the arrival process was modeled as governed by an underlying Markov chain, which is in one of two states, ON or OFF. The OFF state is when there are no 7 arrivals into the system, which would typically be midway into the conference session. During the ON state, arrivals vary randomly over time with a more or less constant arrival rate. The mean inter-arrival time during the ON state is 38 seconds. The mean duration of the OFF state is 6 minutes, with longer OFF periods during the session breaks and the lunch break. Also in this paper they model the error rates and analyzed MAC-level retransmission Their overall analysis of user behavior shows that: Users are evenly distributed across all APs and user arrivals are correlated in time and space and user arrivals can be correlated into the network according to a two-state Markov-Modulated Poisson Process (MMPP). Most of the users have short session times: 60% of the user sessions last less than 10 minutes. Users with longer session times are idle for most of the session. The session time distribution can be approximated by a General Pareto distribution with a shape parameter of 0.78 and a scale parameter of 30.76. The R2 value is 0.9. Short session times imply that network administrators using DHCP for IP address leasing can configure DHCP to provide short-term leases, after which IP addresses can be reclaimed or renewed. Sessions can be broadly categorized based on their bandwidth consumption into light, medium, and heavy sessions: light sessions on average generate traffic at 15 Kbps, medium sessions between 15 and 80 Kbps, and heavy sessions above 80 Kbps. The highest instantaneous bandwidth demand is 590 Kbps. Web traffic accounts for 46% of the total bandwidth of all application traffic, and 57% of all flows. Web and SSH together account for 64% of the total bandwidth and 58% of flows. There is an implicit correlation between session duration and average data rates. Longer sessions typically have very low data requirements. Most of the sessions with high average data rate are very short (< 15 minutes). 8 Their analysis of user mobility shows that users are mobile when expected, i.e., at the beginning and end of the conference sessions. About 75% of the users are seen at more than one AP during the day. Their analysis of network performance shows that the load distribution across APs is highly uneven and does not directly correlate to the number of users at an AP. Stated another way, the peak offered load at an AP is not reached when the number of associated users is a maximum. Rather, the load at an AP is determined more by individual user workload behavior. One implication of this result is that load balancing solely by the number of associated users may perform poorly. Their observations indicate that the traditional method of modeling user arrivals according to a Poisson arrival process may not adequately characterize scenarios where arrivals are correlated with time and space. However, although the MMPP model is well suited to their conference setting where most users follow a common schedule, they do not expect it to generalize to every public-area wireless network. For example, it may be appropriate in an airport network where users cluster at gates at specific times in anticipation of departures, but not for a shopping mall network where we would expect user arrivals, departures, and mobility to be more random. 3. Modeling users’ mobility among Wi-Fi access points.( Minkyong Kim, David Kotz) In this paper, they present a model of user movements between APs. From the syslog messages collected on the Dartmouth campus, they count the number of visits to each AP. Based on the observation that most APs have strong daily repetition, they aggregate the multiple days of the hourly visits into a single day. They then cluster APs based on their peak hour. They derive four clusters with different peak times and one cluster consisting of stable APs whose number of visits does not change much over 24 hours. To model a cluster, they compute hourly arrival and departure rates, and the distribution of daily arrivals. They leave the evaluation of this model as future work. Their experience with the Dartmouth traces has shown that different APs have their peak number of users at different times of the day. They then clustered the rest of APs based on their peak hour. 9 They used a filter to convert the syslog traces into the sequence of APs that each client associates with. This filter also defines the OFF state, which represents a state of being not connected to the network. A device enters the OFF state when it is turned off or when it loses network connectivity. They made a conclusion that all of the clusters, except Cluster 1, have more transitions from/to another AP than from/to the OFF state. The high number of transitions from/to another AP is partly due to the ping-pong effect: associating repeatedly with multiple APs. When a device is within the range of multiple APs, it often changes its associated AP. Thus, changes in association do not necessarily mean that the user moved physically. The ping-pong effect is especially common where the density of APs is high. They found that in the process of developing the model, the number of visits to APs exhibits a strong daily pattern. They also found that clustering APs based on their peak time is effective; 4. Characterizing Flows in Large Wireless Data Networks(Xiaoqiao (George) Meng, Starsky H.Y. Wong, Yuan Yuanz, Songwu Lu) In this paper, they statistically characterize both static flows and roaming flows in a large campus wireless network using a recently collected trace. If only one AP is found to be used by the flow, they categorize the flow as a static flow; otherwise, it is a roaming flow. They explain the modeling results from the perspective of user behaviors and application demands. For example, the Weibull regression model attributes to both observations of strong 24-hour periodicity and diurnal cycle in user activity, and coexistence of applications with short and long inter-arrival times. They use two examples of scheduling and wireless TCP to showcase how to apply their results to evaluate the dynamic behavior of network protocols. Most studies in the wireless literature are based on unrealistic settings of static flow configuration or simplistic Poisson model. They show that such simulation and analysis models can produce misleading results compared with using the models derived from real traces. 10 They found that that the inter-arrival times can be well modeled by a Weibull distribution at fine-time scales, e.g., hourly basis. This result holds for all 24 hourly intervals. In the study, they also tested other five continuous distribution models: Exponential, Lognormal, Gamma, Pareto and Extreme-value. None of them produces a consistently good match. Their further analysis shows that, if further reduce the granularity into fine scales, say, half an hour, the Weibull model still matches well. However, if they increase the granularity to coarser scales, say, two hours, simple distributions such as Weibull model are generally not sufficient because the nonstationarity makes the traffic much more variable. Therefore, they select the hourly scale in modeling. Weibull regression model accurately approximates the flow arrival process in all time scales. For different APs, they further discover that, the parameters of the Weibull regression model are location dependent, and vary from one AP to another. However, APs in the same subnet observe spatial similarity in the sense that, the flow inter-arrival times across APs are highly likely to follow identical statistical distribution. As for the flow duration, they characterize it via the data size each flow transfers, and find that it follows the Lognormal distribution. Such a distribution holds for all APs. 5. Measurement and Analysis of the Error Characteristics of an In-Building Wireless Network. In Proceedings of ACM SIGCOMM’96, pages 243–254, August 1996.( D. Eckardt and P. Steenkiste.) They examined their campus wide WaveLAN installation and focused more on network performance and less on user behavior. The focus of their study was on the error model and signal characteristics of the RF environment in the presence of obstacles. 6. Experience building a high speed, Campus-Wide Wireless Data Network. In Proceedings of ACM MobiCom’97, pages 55–65, August 1997. (B. J. Bennington and C. R. Bartel. Wireless Andrew) 11 The focus of their study was on the installation and maintenance issues of a campus wireless network and comparing its performance to a wired LAN. 7. Analysis of a Local-Area Wireless Network. In Proceedings of ACM MobiCom’00, pages 1–10, August 2000. (D. Tang and M. Baker.) The focus of their study was on the user behavior and traffic characteristics in a university department network. They analyzed a 12-week trace collected from the wireless network used by the Stanford Computer Science department; this study built on earlier work involving fewer users and a shorter duration. Their study provides a good qualitative description of how mobile users take advantage of a wireless network, although it does not give a characterization of user workloads in the network. The Stanford study looked at a network of large geographic size where the users are unevenly distributed across the APs in the building. 8. Analysis of a Metropolitan-Area Wireless Network. In Proceedings of ACM MobiCom’99, pages 13–23, August 1999.( D. Tang and M. Baker) The focus of their study was on the user mobility in a low-bandwidth metropolitan area network. Earlier, Tang and Baker also characterized user behavior in a metropolitan area network, focusing mainly on user mobility. Furthermore, the network was spread over a larger geographical area and had very different performance characteristics. 9. Trace-based Mobile Network Emulation. In Proceedings of ACM SIGCOMM’97, pages 51–61, September 1997. (B. Noble, M. Satyanarayanan, G. Nguyen, and R. Katz.) This work is a joint research effort between CMU and Berkeley that proposed a novel method for network measurement and evaluation applicable to wireless networks. The 12 technique, called trace modulation, involves recording known workloads at a mobile host and using it as input to develop a model for network behavior. Although this work helps in developing a good model of network behavior, it does not provide a realistic characterization of user activity in a mobile setting. 10. Characterizing Usage of a Campus-wide Wireless Network. Technical Report TR2002423, Dartmouth College, March 2002. (D. Kotz and K. Essien) The focus of their study was on the user behavior and traffic characteristics in a college campus. Kotz and Essien traced and characterized the Dartmouth College campus-wide wireless network during their fall 2001 term. Their workload is quite extensive, both in scope (1706 users across 476 access points) and duration (12 weeks). Kotz and Essien focus on large-scale characteristics of the campus, such as overall application mix, overall traffic per building and AP, mobility patterns, etc. In terms of application mix, their network carries a rich set of applications that reflects the nature of campus-wide applications. With the size of their network, they were able to study mobility patterns as well. Interestingly, they found that most users were stationary within a session, and overall associated with just a few APs during the term. Data We used a standard Linksys router as a sniffer that recorded packets sent by users in the network during 6 weeks and 4 days. Every packet contains MAC address of the Access Points, Mac Address of the user, Source/Destination IP Addresses, size of the packet, the time it was received. 13 Photo of the router: Access Points Analysis IEEE 802.11 Architecture A cellular architecture where the system is subdivided into cells, where each cell (called Basic Service Set or BSS, in the 802.11 nomenclature) is controlled by a Base Station (called Access Point or in short AP). Even though a wireless LAN may be formed by a single cell, with a single Access Point, most installations will be formed by several cells, where the Access Points are connected through some kind of backbone (called Distribution System or DS), typically Ethernet, and in some cases wireless itself. The whole interconnected Wireless LAN including the different cells, their respective Access Points and the Distribution System is seen to the upper layers of the OSI model, as a single 802 network, and is called in the standard as Extended Service Set (ESS). The following picture shows a typical 802.11 LAN, with the components described previously: 14 (Brenner) We examine the Electrical Engineering faculty Wi-Fi network with 16 Access Points. Arrival Rate To investigate how the arrival rate of APs changes during the day time we chose an averaging window of 30 minutes. The following graphs show the average arrival rate in units of Bytes window of 30 minutes: 15 min with an averaging Figure 1 - Arrival rate of AP 1 Figure 2 - Arrival rate of AP 2 16 Figure 3 - Arrival rate of AP 3 Figure 4 - Arrival rate of AP 4 17 Figure 5 - Arrival rate of AP 5 Figure 6 - Arrival rate of AP 6 18 Figure 7 - Arrival rate of AP 7 Figure 8 - Arrival rate of AP 8 19 Figure 9 - Arrival rate of AP 9 Figure 10 - Arrival rate of AP 10 20 Figure 11 - Arrival rate of AP 11 Figure 12 - Arrival rate of AP 12 21 Figure 13 - Arrival rate of AP 13 Figure 14 - Arrival rate of AP 14 22 Figure 15 - Arrival rate of AP 15 Figure 16 - Arrival rate of AP 16 23 We see that some APs are active from midday till the evening, and some APs are active only in the evening or during the specific hour. Access Point 1achieves maximal average arrival rate that and Access Points 9, 10, 16 achieve maximal average approximately equals to 7 105 B min . arrival rate that approximately equals to 0.14 B min We chose AP 1 for analyzing arrival rate with different averaging windows and we produced graphs of AP 1 for different averaging windows: Averaging window of 0.1 hour: Figure 17 - Arrival rate of AP 1 with averaging window of 0.1 hour 24 Averaging window of 0.2 hour: Figure 18 - Arrival rate of AP 1 with averaging window of 0.2 hour Averaging window of 0.25 hour: Figure 19 - Arrival rate of AP 1 with averaging window of 0.25 hour 25 Averaging window of 0.3 hour: Figure 20 - Arrival rate of AP 1 with averaging window of 0.3 hour Averaging window of 0.35 hour: Figure 21 - Arrival rate of AP 1 with averaging window of 0.35 hour 26 Averaging window of 0.4 hour: Figure 22 - Arrival rate of AP 1 with averaging window of 0.4 hour Averaging window of 0.5 hour: Figure 23 - Arrival rate of AP 1 with averaging window of 0.5 hour 27 The following graph shows the difference between the average arrival rates for different averaging windows: Figure 24 - Arrival rate of AP 1 with different averaging windows We see that there is a tradeoff in the averaging window selection. Small averaging windows produce graphs with sharp peaks and it will be hard to fit a function for such graphs. Big averaging windows produce smooth graphs and it is easy to fit a function for such graphs but also big averaging windows cause data loss. 28 Users We gathered statistics of 3273 users. We chose a few users to show their transmission rate during the day time with an averaging window of 1 hour. user1: Figure 25 -Transmission rate of user 1 during the day with averaging window 0f 1 hour 29 user 2: Figure 26 - Transmission rate of user 2 during the day with averaging window 0f 1 hour user 3: 30 Figure 27 - Transmission rate of user 3 during the day with averaging window 0f 1 hour We see that these users are active from 9 am. User 1 and 2 are active till 9pm and user 3 is active till 7 pm. We also show the average arrival rate of users during the week with averaging window of 1 hour: user 2: 31 Figure 28 - Transmission rate of user 2 during the week with averaging window 0f 1 hour user 3: 32 Figure 29 - Transmission rate of user 3 during the week with averaging window 0f 1 hour We see that these users are active only during the first part of the week. Visit duration 33 We are interested in defining the users’ visit duration. When can we say that the visit has ended and the next received packets will represent the start of a new visit? The following figures show the inter-arrival times of packets of some user during different days and different times of the day: Figure 30 - Inter-arrival times of packets from 9:30 am till 11:18 am The following figure shows that the typical intervals between packets bursts are 24 min, 55 min, and 2 hours. We see that user is active during the breaks and not active during the lectures that last 50-55 minutes. 34 Figure 31 - Inter-arrival times of packets from 12:30 pm till 16:53 pm The following figure shows that the typical intervals between packets bursts are 10 min, 25 min, 50 min, 66 min, 4 hours: Figure 32 - Inter-arrival times of packets during the day 35 We want to establish a maximal interval between two packets that can be considered as part of one visit. To do so we use the histograms of packet’s inter-arrival times: Figure 33 - Histogram of user's packets inter-arrival times Figure 34 - Histogram of user's packets inter-arrival times 36 Figure 35 - Histogram of user's packets inter-arrival times We chose some users with significant number of packets and got histograms of such users: Figure 36 - Histogram of user's packets inter-arrival times 37 Figure 37 - Histogram of user's packets inter-arrival times We chose 30 minutes as a maximal inter-arrival time between two packets that can be considered as packets of one visit. If the inter-arrival time is more than 30 minutes, then we recognize it as the start of new visit. 38 Features We chose to base a clustering of the users on the following features: average visit duration, average inter-arrival times between the visits, average traffic, number of visits and total number of days in the system. The following graphs show the correlation between those features: Figure 38 - Average inter-visits times vs. Average visit duration 39 Figure 39 - Average inter-visits times vs. Number of visits Figure 40 - Average traffic vs. Average visit duration 40 Figure 41 - Average traffic vs. Number of visits Figure 42 - Average visit duration vs. Number of visits 41 Figure 43 - Average Inter visits times vs. Total days in system According to the above figures we don’t see any typical clusters that can be found among the networks users. We based the clustering on the average characteristics that is why we want to examine a standard deviation of the characteristics that were used in the clustering. We expect the characteristics to have small standard deviation in order to achieve proper results of clustering algorithm using average characteristics. 42 The standard deviation of inter-arrival times of visits: Figure 44 - Standard deviation of inter-arrival times of visits The standard deviation of traffic per packets: Figure 45 - Standard deviation of traffic per packets 43 The standard deviation of visit duration times: Figure 46 - Standard deviation of visit duration times As we see none of the characteristics have a significant standard deviation. 44 Clustering Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. We can show this with a simple graphical example: Possible Applications: Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records; Biology: classification of plants and animals given their features; Libraries: book ordering; Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds; 45 City-planning: identifying groups of houses according to their house type, value and geographical location; Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones; WWW: document classification; clustering weblog data to discover groups of similar access patterns. (A Tutorial on Clustering Algorithms) Distance Measure An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case the Euclidean distance can sometimes be misleading. The figure shown below illustrates this with an example of the width and height measurements of an object. Despite both measurements being taken in the same physical units, an informed decision has to be made as to the relative scaling. As the figure shows, different scalings can lead to different clusterings. Notice however that this is not only a graphic issue: the problem arises from the mathematical formula used to combine the distances between the single components of the data feature vectors into a unique distance measure that can be used for clustering purposes: different formulas leads 46 to different clusterings. Again, domain knowledge must be used to guide the formulation of a suitable distance measure for each particular application. (A Tutorial on Clustering Algorithms) K-Means Clustering In the first step of the clustering we use k-means clustering algorithm and analyze its results. In data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results in a partitioning of the data space into Voronoi cells. The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. Given a set of observations ( x1 , x2 ,..., xn ) , where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S {S1 , S2 ,..., Sk } so as to minimize the within-cluster sum of squares (WCSS): k arg min i 1 x j Si x j i 2 where i is the mean of points in Si. We used a matlab function kmeans(X,k) which partitions the points in the n-by-p data matrix X into k clusters. kmeans returns an n-by-1 vector containing the cluster indices of each point. By default, kmeans uses squared Euclidean distances. K-Means Results 47 For clustering we use the following features: average visit duration, average inter-visit times, average traffic per packets, maximal distance between visits, and minimal distance between visits. The following figures show the results of k-means algorithm when k 2 : Figure 47 - k-means results when k=2: Average visit duration vs. Average inter-visits times Figure 48 - k-means results when k=2: Average visit duration vs. Average inter-visits times vs. Average traffic per packets 48 Figure 49 - k-means results when k=2: Average inter-visits times vs. Average traffic per packets Figure 50 - k-means results when k=2: Maximal distance between visits vs. Minimal distance between visits 49 The following figures show the results of k-means algorithm when k 3 : Figure 51 - k-means results when k=3: Average visit duration vs. Average inter-visits times Figure 52 - k-means results when k=3: Average inter-visits times vs. Average traffic per packets 50 Figure 53 - k-means results when k=3: Maximal distance between visits vs. Minimal distance between visits The following figures show the results of k-means algorithm when k 4 : Figure 54 - k-means results when k=4: Average visit duration vs. Average inter-visits times 51 Figure 55 - k-means results when k=4: Average inter-visits times vs. Average traffic per packets Figure 56 - k-means results when k=4: Maximal distance between visits vs. Minimal distance between visits 52 We can’t easily identify typical clusters as for example we can in the following figure: As we see we did not find any isolated clusters. One of the problems of the users clustering is the decision making upon Euclidian distances criteria because the distance measure between data points is an important component of a clustering algorithm. Our components of the data instance vectors are not in the same physical units that is why we did not succeed at successfully grouping similar data instances. G-Means Algorithm We decided to use each visit as a point and not to use clustering based on the average characteristics. Each point consists of the following components: the visit duration, the inter time between the visits and the previous visit, number of packets that were sent during the visit and the average amount of data that was accessed during the visit. We normalize the data components to get proper results even with simple Euclidean distance metric. We used g-means clustering algorithm to find an optimal number of clusters to use. When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. There is an improved algorithm for learning k while clustering which is called G-means. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance 53 matrix. Additionally, G-means only requires one intuitive parameter, the standard statistical significance level . (Hamerly & Elkan) The statistical test detects whether the data assigned to a center are sampled from a Gaussian by accepting one of two following hypotheses: • H 0 : The data around the center are sampled from a Gaussian. • H1 : The data around the center are not sampled from a Gaussian. If the null hypothesis H 0 is accepted, then the one center is sufficient to model its data, and the cluster should not be split into two sub-clusters. If H 0 is rejected, then the cluster will be split. The test that is used in g-means algorithm is based on the Anderson-Darling statistic. This onedimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). We must choose the significance level of the test, α, which is the desired probability of incorrectly rejecting H 0 . It is appropriate to use a Bonferroni adjustment to reduce the chance of incorrectly rejecting H 0 over multiple tests. For example, if we want a 0.01 chance of incorrectly rejecting H 0 in 100 tests, we should apply a Bonferroni adjustment to make each test use α = 0.01/100 = 0.0001. To find k final centers the G-means algorithm makes k statistical tests, so the Bonferroni correction does not need to be extreme. We used a code for g-means algorithm that finds the optimal number of clusters and runs kmeans algorithm with optimal number of clusters. 54 The following figure shows the dependence of number of clusters on : Figure 57 - Number of clusters produced by the g-means algorithm vs. Significance level α As was sad before α is the desired probability of incorrectly rejecting H 0 . So the bigger the significance level α, the bigger the probability of incorrectly rejecting H 0 and the bigger the probability of splitting a cluster into two sub-clusters. That is why the bigger the α we use, the more clusters are produced by the g-means algorithm. For g-means algorithm we used 0.0001 . We present the results of the algorithm that produced 70 clusters. 55 Histograms of clusters for some users: user 136 labels 4 3.5 3 2.5 2 1.5 1 0.5 0 0 10 20 30 40 50 60 70 Figure 58 - User's Clusters produced by the g-means algorithm The user 136 had 58 visits and sent 8788 packets. He was associated with 30 clusters; the most common clusters for this user are clusters 11, 20, 29 and 35. Each of the most common clusters contains 4 samples. 56 user 202 labels 6 5 4 3 2 1 0 0 10 20 30 40 50 60 70 Figure 59 - User's Clusters produced by the g-means algorithm The user 202 had 59 visits and sent 28777 packets. He was associated with 31 clusters; the most common cluster for this user is cluster 30 that contains 6 samples. 57 user 240 labels 3 2.5 2 1.5 1 0.5 0 0 10 20 30 40 50 60 70 Figure 60 - User's Clusters produced by the g-means algorithm The user 240 had 55 visits and sent 23801 packets. He was associated with 36 clusters; the most common clusters for this user are clusters 20, 22, 30, 59 and 60. Each of the most common clusters contains 3 samples. According to the histograms it is hard to represent a user by one typical cluster. Evaluation of clustering Typical objective functions in clustering formalize the goal of attaining high intra-cluster similarity and low inter-cluster similarity. This is an internal criterion for the quality of a clustering. But good scores on an internal criterion do not necessarily translate into good effectiveness in an application. An alternative to internal criteria is direct evaluation in the application of interest. For search result clustering, we may want to measure the time it takes users to find an answer with different clustering algorithms. This is the most direct evaluation, but it is expensive, especially if large user studies are necessary. (Manning, Raghavan, & Schütze) 58 We will use external criteria of clustering quality which is called purity. Purity is a simple and transparent evaluation measure. To compute purity, each cluster is assigned to the class which is most frequent in the cluster. Formally: purity (, C ) 1 max k c j N k j where {1 , 2 ,..., k } is the set of clusters and C {c1 , c2 ,..., cJ } is the set of classes. For example, if the following figure represents the result of the clustering algorithm: (Manning, Raghavan, & Schütze) then the majority class and number of members of the majority class for the three clusters are: x, 5 (cluster 1); , 4 (cluster 2); and ,3 (cluster 3). Purity is (1 17) (5 4 3) 0.71 . 59 The following figure shows the dependence of the purity on : alpha vs purity 0.18 0.16 0.14 purity 0.12 0.1 0.08 0.06 0.04 0.02 0 0 0.01 0.02 0.03 0.04 0.05 alpha 0.06 0.07 0.08 0.09 0.1 Figure 61 - Evaluation measure Purity vs. significance level α The bigger the , the more clusters are produced by the algorithm. It is logical that the purity increases with the number of clusters. Also we present a new evaluation measure E that is calculated in the following fashion: For each user we determine the most common cluster and the number of samples contained in this cluster than we divide the number of samples contained in the most common cluster by the users’ total number of samples. We find the average of these received values: E 1 N xi , N i 1 M i where N - total number of users, xi - number of samples contained in the most common cluster of user i , M i - total number of samples of user i . 60 We would like to examine if most of the users’ visits are clustered into one cluster, so the user is represented by one typical cluster. The measure E represents the level of possibility of representing each user by one typical cluster. The following figure shows the dependence of the evaluation measure E on : 0.14 0.13 0.12 0.11 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0 0.01 0.02 0.03 0.04 0.05 alpha 0.06 0.07 0.08 0.09 0.1 Figure 62 - Evaluation measure E vs. significance level α The bigger the , the more clusters are produced by the algorithm. It is logical that the E decreases with the number of clusters. In the case of big number of clusters, a user is associated with the big number of clusters and it is more difficult to represent a user by one typical cluster. We see that the upper bound of E is 0.13 which means that g-means algorithm doesn’t achieve good performance when clustering networks’ users. 61 Conclusions The conclusions that were made in this work: 1. The Access Points’ arrival rate is coherent with the time of lectures and breaks. The Aps show low activity during the lectures and high activity during the breaks. 2. k-means clustering algorithm based on average characteristics of networks’ users can’t produce any isolated clusters. That is why we conclude that this algorithm can’t cluster well the networks’ users. 3. g-means clustering algorithm based on the points that consist of the 4 characteristics (that were described earlier) can’t represent each user by one typical cluster. That is why we conclude that this algorithm can’t cluster well the networks’ users. Bibliography A Tutorial on Clustering Algorithms. (n.d.). Retrieved from http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/. Brenner, P. (n.d.). A technical tutorial on the IEEE 802.11 protocol. Hamerly, G., & Elkan, C. (n.d.). Learning the k in k-means. Manning, C. D., Raghavan, P., & Schütze, H. (n.d.). Introduction to Information Retrieval. 62