Extract: Mining Social Features from WLAN Traces A Gender-Based Case Study Udayan Kumar and Ahmed Helmy Computer and Information Sciences and Engineering, University of Florida, Gainesville ukumar, helmy @cise.ufl.edu http://www.cise.ufl.edu/~helmy 1 Introduction • Mobile service more ubiquitous, human centric – Devices (e.g., mobile phones) as sensors, and the human society as the sensor network • Challenge: to realistically model social behavior for the simulation, evaluation and design of future mobile networks • Approach: capture and analyze extensive network traces – How much information can be extracted from the traces?!! • We present, as a first step, systematic methods to classify WLAN users into groups based on WLAN usage traces into social clusters using features like gender, study major 2 Outline 1. 2. Traces Used Classification Methods I. Location Based Classification A. B. C. D. II. Name Based Classification A. 3. Validation of Location Based Method via Name Based Classification User Behavior Analysis I. II. III. 4. 5. Individual Behavior Based Filter Group Behavior Based Filter Hybrid Filter Validation of Location Based Method Spatial Distribution Temporal Distribution Device Preferences Applications Conclusion and Future Work 3 1. Traces Analyzed • Consider WLAN association traces from 2 University campus – U1 and U2 (names omitted for privacy) University Time Period Users Access Points U1 Feb 2006 to Feb 2007 ~20K 150 U2 Nov 2007 to Apr 2008 ~30K 700 • Traces provide following information – MAC, Association Start time, Duration, Location/AP names. • Traces from U2 also provide usernames. 4 1. Traces Analysis • A trace does not provide personal information such as gender, major etc. about the users. • How can we use this information classify users into groups based on gender, study major? • How much information can we get from these published data sets? 5 2-I. Location Based Classification (LBC)* •US university campuses have Fraternities and Sororities. Fraternities house males and sororities house females. (other campuses may have separate male and female housing or urban areas may have places that are gender biased) •User association in Fraternity AP can tell us that user is Male (vice-versa for females) •But what about visitors? We need filtering! * Results shown are using traces from U1 6 2-I. Filtering • Individual Behavior Based and Group behavior Based Filtering. A. Individual Behavior filter (IBF) considers the fraction of time user associates with AP’s in a building with respect to user’s total associations. B. Group Behavior filter (GBF) considers a user’s association with AP’s in a building with respect all the other users associating to same AP’s. 7 Individual Behavior based Filtering (IBF) • We use two metrics based on: – Counts – Duration PCM(u) C f (u) PDM(u) Regular Users C f (u) Cs (u) D f (u) D f (u) Ds (u) • Consider all users visiting fraternities and sororities. • Sharp drop indicates the division between two groups. (PCD/PCM >0.8 considered regular users) Users visiting Fraternity and/or Sorority in decreasing order of their Male probability (U1 feb2006) Cf(u) means count of sessions in fraternity by user u Cs(u) means count of sessions in sorority by user u Df(u) means duration of sessions in fraternity by user u Ds(u) means duration of sessions in sorority by user u I-B. Group Behavior based Filtering (GBF) • For using group behavior to filter we use clustering techniques. We use PAM * (Partitioning Around Mediods) algorithm. • PAM provides methods for measuring clustering quality. * L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, March 1990. 9 I-B. Group Behavior based Filtering • We use 3 metrics for clustering: Distinct Days of login, number of session, duration of sessions • Using this clustering technique we are able to distinguish between user cluster Clustering results for University U1 Sororities (feb2006) 10 I-C. Results and Hybrid Filter (HF) Hybrid Filter: based on intersection of results from IBF and GBF. This gives us better confidence on the results. U1 IBF U1 GBF U1 HF Feb 2006 Oct 2006 Feb 2007 Feb 2006 Oct 2006 Feb 2007 Feb 2006 Oct 2006 Feb 2007 Total Users 16416 22405 20302 16416 22405 20302 16416 22405 20302 Males 506 553 545 451 437 417 416 418 399 Females 513 570 509 441 456 410 435 453 406 Common 0 0 0 22 37 29 0 0 0 11 I-D. Validation of LBC • To increase confidence in LBC, validation is needed. • However, validation with ground truth is difficult. (mac addresses are anonymized, surveys may be incorrect, universities don’t provide data due to privacy issues) • Instead, we devised 3 statistical techniques to validate LBC. (3 one is presented after Name Based Classification) rd 12 I-D-a. Temporal Consistency • Classification should remain consistent over a time period (adjacent months in same sem.) • If our filtering increases the consistency or similarity, it is likely classifying correctly. Month a Month b Before filtering IBF GBF HF Feb2006 Mar-Apr 2006 72.3 % 87.7 % 92.7 % 92.4 % Oct 2006 Nov 2006 66.8 % 80.9 % 87.6 % 88.3 % For users70.3 visiting Mar-Apr 2007 % sororities 81.9 % 92.3 % 90.4 % Feb2007 13 I-D-b. IBF vs GBF • Both filtering techniques should capture the same set of users. Therefore, we compare the results Month Feb 2006 Oct 2006 Feb 2007 Gender IBF GBF IBF Male 506 451 Female 513 441 Male 553 437 418 Female 570 451 454 Male 545 417 399 Female 529 410 406 I GBF 416 435 • The comparison shows more than 75% of the users are the same 14 2-II. Name-Based Classification (NBC) • In this technique with augment traces with external data. • Traces from U2 provide – username. • We combine usernames with publicly available phone book directory maintained by U2 to obtain names of the users. (users have an opt out option) • Next we run most common male and female names obtained from US Government SSN office over the names obtained above to determine gender of the user. 2-II. Name-Based Classification (NBC) Nov 2007 Apr 2008 Total Users 27068 29982 Males (NBC) 5245 5807 Females (NBC) 5955 6817 •Compared to NBC, LBC requires less information (username not needed) •NBC provides a ways to validate LBC. •The use of NBC is limited as the availability of usernames is limited to a very few currently available traces. •Once we check the correctness of LBC, this can become the primary method for classification. 16 II-A. Cross Validation of LBC • Compared LBC with NBC using traces from U2 • Advantage is that NBC has low error rate in classification. • Error in classification is calculated by Female E (FL MN) /FL Classification f Male Classification Month FL FL I MN Nov 2007 1280 74 334 0.058 Apr 2008 1690 123 0.072 FL is classification as Female by LBC MN is classification as Male by NBC Ef Em (ML FN)/ ML ML 349 ML I FN Em 25 0.074 29 0.083 ML is classification as Male by LBC FN is classification as Female by NBC 17 3. User Behavior Analysis • Traces allow us to track a user throughout the traces. So we can track classified users throughout the campus and study their network usage behavior ! • We consider – User Spatial distribution – Temporal Analysis – Device Preferences 18 3-I. Spatial Distribution U1 U2 •Both universities show more females in Social Sciences and Sports buildings •Both universities show more males in Economics and Engineering buildings •Inconsistent trends observed at Music buildings 19 3-II. Temporal Analysis (session durations) U2 U1 •On average males have longer session durations • Overall time session durations are getting shorter (are users becoming more mobile?)20 3-III. Device Preference • Using Mac address one can find out the manufacturer of the devices. • Our analysis at U1 shows (with statistical significance) that females prefer apple computer over PC. • However, no such preference is shown at U2 for the general population. • We also see that external adapter vendors like Enterasys, Linksys, D-Link have a decreasing trend in terms of number of users. Most users are getting inbuilt wifi devices. 21 3-III. Device Preference Females prefer Apple computer over Intel based! U1 22 4. Applications • Mobility Models – Incorporate effects of building context, ‘behavioral’ aspects, load (sessions duration) and density among others on correlated collective/group behavior • Protocol Design – Effects of group behavior can be incorporated in protocol design for mobile networks. • Privacy – The gender of the users could be inferred/extracted from anonymized traces! 23 5. Conclusion & Future Work • We introduce new methods to classify users into social groups based on features like gender, study-major among others. • We used our methods on traces collected from two different university campuses. • The methods are able to distinguish between major differences in group behaviors (mobility, vendor pref.) • Issues of privacy and anonymity arise when dealing with wireless networks traces [UH09] • This study opens doors for other mobile social networking studies and profile-based service designs based on sensing the human society. 24 Thank you! Ahmed Helmy helmy@ufl.edu URL: www.cise.ufl.edu/~helmy helmy@ufl.edu 25 Appendix 26 Sororities and Fraternities Number U1 U2 Sorority 7 13 Fraternity 12 5 27 PAM • PAM attempts to minimize dissimilarity in a cluster. • Provides technique called Silhouette Widths and plot to measure quality of the clusters. • The average width can be used to estimate the quality of the clustering; above 0.70 for strong clustering, between 0.50 – 0.70 for a reasonable structure and below 0.50 for weak structure • All clusters we found where above .65 cluster quality. 28 3-III. Device Preference No gender bias is noticed at U2 U2 29