Extract A Gender-Based Case Study Udayan Kumar and Ahmed Helmy

advertisement
Extract: Mining Social Features from WLAN Traces
A Gender-Based Case Study
Udayan Kumar and Ahmed Helmy
Computer and Information Sciences and Engineering,
University of Florida, Gainesville
ukumar, helmy @cise.ufl.edu
http://www.cise.ufl.edu/~helmy
1
Introduction
• Mobile service more ubiquitous, human centric
– Devices (e.g., mobile phones) as sensors, and the human society
as the sensor network
• Challenge: to realistically model social behavior for the
simulation, evaluation and design of future mobile networks
• Approach: capture and analyze extensive network traces
– How much information can be extracted from the traces?!!
• We present, as a first step, systematic methods to classify
WLAN users into groups based on WLAN usage traces into
social clusters using features like gender, study major
2
Outline
1.
2.
Traces Used
Classification Methods
I.
Location Based Classification
A.
B.
C.
D.
II.
Name Based Classification
A.
3.
Validation of Location Based Method via Name Based Classification
User Behavior Analysis
I.
II.
III.
4.
5.
Individual Behavior Based Filter
Group Behavior Based Filter
Hybrid Filter
Validation of Location Based Method
Spatial Distribution
Temporal Distribution
Device Preferences
Applications
Conclusion and Future Work
3
1. Traces Analyzed
• Consider WLAN association traces from 2
University campus – U1 and U2 (names omitted for privacy)
University
Time Period
Users
Access Points
U1
Feb 2006 to Feb 2007
~20K
150
U2
Nov 2007 to Apr 2008
~30K
700
• Traces provide following information – MAC,
Association Start time, Duration, Location/AP
names.
• Traces from U2 also provide usernames.
4
1. Traces Analysis
• A trace does not provide personal information
such as gender, major etc. about the users.
• How can we use this information classify
users into groups based on gender, study
major?
• How much information can we get from these
published data sets?
5
2-I. Location Based Classification (LBC)*
•US university campuses
have Fraternities and
Sororities. Fraternities
house males and sororities
house females. (other campuses may have separate male and female housing or
urban areas may have places that are gender biased)
•User association in Fraternity AP can tell us
that user is Male (vice-versa for females)
•But what about visitors? We need filtering!
* Results shown are using traces from U1
6
2-I. Filtering
• Individual Behavior Based and Group behavior
Based Filtering.
A. Individual Behavior filter (IBF) considers the
fraction of time user associates with AP’s in a
building with respect to user’s total associations.
B. Group Behavior filter (GBF) considers a user’s
association with AP’s in a building with respect
all the other users associating to same AP’s.
7
Individual Behavior based Filtering (IBF)
• We use two metrics based
on:
– Counts
– Duration
PCM(u) 
C f (u)
PDM(u) 
Regular Users
C f (u)  Cs (u)
D f (u)
D f (u)  Ds (u)

• Consider
all users visiting
fraternities
and sororities.

• Sharp drop indicates the
division between two
groups. (PCD/PCM >0.8
considered regular users)
Users visiting Fraternity and/or Sorority in
decreasing order of their Male probability (U1
feb2006)
Cf(u) means count of sessions in fraternity by user u
Cs(u) means count of sessions in sorority by user u
Df(u) means duration of sessions in fraternity by user u
Ds(u) means duration of sessions in sorority by user u
I-B. Group Behavior based Filtering (GBF)
• For using group behavior to filter we use
clustering techniques. We use PAM *
(Partitioning Around Mediods) algorithm.
• PAM provides methods for measuring
clustering quality.
* L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster
Analysis. Wiley-Interscience, March 1990.
9
I-B. Group Behavior based Filtering
• We use 3 metrics for
clustering: Distinct Days
of login, number of
session, duration of
sessions
• Using this clustering
technique we are able to
distinguish between user
cluster
Clustering results for University U1 Sororities
(feb2006)
10
I-C. Results and Hybrid Filter (HF)
Hybrid Filter: based on intersection of results
from IBF and GBF. This gives us better
confidence on the results.
U1 IBF
U1 GBF
U1 HF
Feb
2006
Oct
2006
Feb
2007
Feb
2006
Oct
2006
Feb
2007
Feb
2006
Oct
2006
Feb
2007
Total Users
16416
22405
20302
16416
22405
20302
16416
22405
20302
Males
506
553
545
451
437
417
416
418
399
Females
513
570
509
441
456
410
435
453
406
Common
0
0
0
22
37
29
0
0
0
11
I-D. Validation of LBC
• To increase confidence in LBC, validation is
needed.
• However, validation with ground truth is
difficult. (mac addresses are anonymized, surveys may be incorrect, universities don’t provide
data due to privacy issues)
• Instead, we devised 3 statistical techniques to
validate LBC. (3 one is presented after Name Based Classification)
rd
12
I-D-a. Temporal Consistency
• Classification should remain consistent over a
time period (adjacent months in same sem.)
• If our filtering increases the consistency or
similarity, it is likely classifying correctly.
Month a
Month b
Before
filtering
IBF
GBF
HF
Feb2006
Mar-Apr 2006
72.3 %
87.7 %
92.7
%
92.4 %
Oct 2006
Nov 2006
66.8 %
80.9 %
87.6
%
88.3 %
For users70.3
visiting
Mar-Apr 2007
% sororities
81.9 %
92.3
%
90.4 %
Feb2007
13
I-D-b. IBF vs GBF
• Both filtering techniques should capture the same
set of users. Therefore, we compare the results
Month
Feb 2006
Oct 2006
Feb 2007
Gender
IBF
GBF
IBF
Male
506
451
Female
513
441
Male
553
437
418
Female
570
451
454
Male
545
417
399
Female
529
410
406
I
GBF
416

435
• The comparison shows more than 75% of the
users are the same
14
2-II. Name-Based Classification (NBC)
• In this technique with augment traces with
external data.
• Traces from U2 provide – username.
• We combine usernames with publicly available
phone book directory maintained by U2 to obtain
names of the users. (users have an opt out option)
• Next we run most common male and female
names obtained from US Government SSN office
over the names obtained above to determine
gender of the user.
2-II. Name-Based Classification (NBC)
Nov 2007
Apr 2008
Total Users
27068
29982
Males (NBC)
5245
5807
Females (NBC)
5955
6817
•Compared to NBC, LBC requires less information (username not needed)
•NBC provides a ways to validate LBC.
•The use of NBC is limited as the availability of usernames is limited to a very
few currently available traces.
•Once we check the correctness of LBC, this can become the primary method
for classification.
16
II-A. Cross Validation of LBC
• Compared LBC with NBC using traces from U2
• Advantage is that NBC has low error rate in
classification.
• Error in classification is calculated by
Female
E  (FL  MN) /FL
Classification f

Male
Classification
Month
FL
FL I MN
Nov 2007
1280
74
 334
0.058
Apr 2008
1690
123
0.072
FL is classification as Female by LBC
MN is classification as Male by NBC
Ef
Em  (ML  FN)/ ML
ML

349
ML I FN
Em
25
0.074
29
0.083
ML is classification as Male by LBC
FN is classification as Female by NBC
17
3. User Behavior Analysis
• Traces allow us to track a user throughout the
traces. So we can track classified users
throughout the campus and study their network
usage behavior !
• We consider
– User Spatial distribution
– Temporal Analysis
– Device Preferences
18
3-I. Spatial Distribution
U1
U2
•Both universities show more females in Social Sciences and Sports buildings
•Both universities show more males in Economics and Engineering buildings
•Inconsistent trends observed at Music buildings
19
3-II. Temporal Analysis
(session durations)
U2
U1
•On average males have longer session durations
• Overall time session durations are getting shorter (are users becoming more mobile?)20
3-III. Device Preference
• Using Mac address one can find out the
manufacturer of the devices.
• Our analysis at U1 shows (with statistical
significance) that females prefer apple computer
over PC.
• However, no such preference is shown at U2 for
the general population.
• We also see that external adapter vendors like
Enterasys, Linksys, D-Link have a decreasing
trend in terms of number of users. Most users are
getting inbuilt wifi devices.
21
3-III. Device Preference
Females prefer
Apple computer
over Intel based!
U1
22
4. Applications
• Mobility Models
– Incorporate effects of building context, ‘behavioral’
aspects, load (sessions duration) and density among
others on correlated collective/group behavior
• Protocol Design
– Effects of group behavior can be incorporated in
protocol design for mobile networks.
• Privacy
– The gender of the users could be inferred/extracted
from anonymized traces!
23
5. Conclusion & Future Work
• We introduce new methods to classify users into social
groups based on features like gender, study-major
among others.
• We used our methods on traces collected from two
different university campuses.
• The methods are able to distinguish between major
differences in group behaviors (mobility, vendor pref.)
• Issues of privacy and anonymity arise when dealing
with wireless networks traces [UH09]
• This study opens doors for other mobile social
networking studies and profile-based service designs
based on sensing the human society.
24
Thank you!
Ahmed Helmy helmy@ufl.edu
URL: www.cise.ufl.edu/~helmy
helmy@ufl.edu
25
Appendix
26
Sororities and Fraternities
Number
U1
U2
Sorority
7
13
Fraternity
12
5
27
PAM
• PAM attempts to minimize dissimilarity in a
cluster.
• Provides technique called Silhouette Widths and
plot to measure quality of the clusters.
• The average width can be used to estimate the
quality of the clustering; above 0.70 for strong
clustering, between 0.50 – 0.70 for a reasonable
structure and below 0.50 for weak structure
• All clusters we found where above .65 cluster
quality.
28
3-III. Device Preference
No gender bias is
noticed at U2
U2
29
Download