Mining User Similarity Based on Location History Yu Zheng, Quannan Li, Xing Xie Microsoft Research Asia Outline • Introduction • Architecture – Modeling Location History – Measuring User Similarity • Experimental Results • Conclusion Introduction (1) • Goals – Inferring the similarity (correlations ) between users from their location histories – Enable friend recommendation ο Personalized location recommendation • Motivation – The increasing availability of user-generated trajectories • Life logging, Travel experience sharing • Sports activity analysis, Multimedia content management,… – People’s outdoor movements in the real world imply their interests • Like sports: if frequently visit gyms and stadiums • Like Travel: if usually access mountains and lakes – According to the first law of the geography • Everything is related to everything else, but near things are more related than distant things. • People with similar location histories might share similar interests and preferences. – Significance of user similarity in Web communities • Generally, it help users find more relevant information from a large-scale dataset • In GIS community: friend discovering and location recommendation Introduction (2) • Difficulty & Challenges – How to model different users’ location history uniformly • Various users’ location histories are inconsistent and incomparable • What’s a shared location? By distance ?? X – How to measure the similarity between users • By counting the number of shared locations ?? • The Pearson correlation and the cosine correlation ?? • They do not take into account two important properties of people’s outdoor movements. • Contribution and insights – A step towards integrating social networking into GIS – A hierarchical-graph • Uniformly modeling different users’ location histories on a various scales of geo-spaces – A similarity measure considering • Sequence property of users’ movement behavior • Hierarchy property of geographic spaces Preliminary • GPS logs P and GPS trajectory • Stay points S={s1, s2,…, sn}. – Stands for a geo-region where a user has stayed for a while – E.g., if a user spent more 20 minutes within a distance of 200 meters – Carry a semantic meaning beyond a raw GPS point • Location history: πΏπππ» = (π 1 βπ‘ 1 π 2 βπ‘ 2 βπ‘ π −1 ,…, π π ) – represented by a sequence of stay points – with transition intervals A Stay Point S Latitude, Longitude, Time p1: Lat1, Lngt1, T1 p2: Lat2, Lngt2, T2 ………... pn: Latn, Lngtn, p1 p6 p3 p7 p2 Tn p4 p5 Architecture (1) GPS Logs of User i GPS Logs of User 2 GPS Logs of User 1 GPS Logs of User i+1 GPS Logs of User n-1 GPS Logs of User n Modeling Location History A Hierarchical Graph for each individual l1 G1 c21 c20 c34 G3 c20 c33 c30 c34 G3 c32 l3 l3 c32 G2 G2 G2 c33 c32 c30 l3 c33 c33 c32 c31 l1 G1 c21 l2 l2 c20 c20 l3 c31 c21 l2 G2 c30 G1 G1 c21 l2 Traj2 l1 l1 c34 G3 Measuring Similarity A similarity score Sij for each pair of users c31 c34 G3 Modeling Location History (1) GPS Logs of User i GPS Logs of User 2 GPS Logs of User 1 1. Stay point detection 2. Hierarchical clustering 3. Individual graph building GPS Logs of User i+1 GPS Logs of User n-1 GPS Logs of User n Modeling Location History A Hierarchical Graph for each individual l1 G1 c21 c21 l2 c32 c34 G3 c20 c33 c32 c30 c34 G3 G2 G2 c32 l3 l3 c33 c30 l3 c33 c33 c32 c31 l1 G1 c21 l2 l2 c20 G2 l3 c31 c21 l2 c20 G2 c30 G1 G1 c20 Traj2 l1 l1 c34 G3 Measuring Similarity A similarity score Sij for each pair of users c31 c34 G3 Modeling Location History (2) GPS Logs of User 1 GPS Logs of User i GPS Logs of User 2 GPS Logs of User i+1 GPS Logs of User n-1 GPS Logs of User n 1. Stay point detection 2. Hierarchical clustering GPS Logs of User 1 GPS Logs of User 2 Layer 1 Layer 1 c10 G1 Layer 2 G2 High G1 {C } c20 c21 A A B Low c c30 c31 c32 c33 B c34 Layer 3 G3 3. Individual graph building Layer 3 Low e a Layer 2 G2 High Stands for a stay point S Stands for a stay point cluster cij Shared Hierarchical Framework e a d b G3 Measuring User Similarity (1) GPS Logs of User i GPS Logs of User 2 GPS Logs of User 1 GPS Logs of User i+1 GPS Logs of User n-1 GPS Logs of User n Modeling Location History A Hierarchical Graph for each individual l1 G1 c21 c21 l2 c32 1. Sequence Extraction 2. Sequence Matching 3. Similarity Score Calculating c34 G3 c20 c33 c32 c30 c34 G3 G2 G2 c32 l3 l3 c33 c30 l3 c33 c33 c32 c31 l1 G1 c21 l2 l2 c20 G2 l3 c31 c21 l2 c20 G2 c30 G1 G1 c20 Traj2 l1 l1 c34 G3 c31 Measuring Similarity A similarity score Sij for each pair of users c34 G3 Measuring Similarity (2) , , • Similar sequence Extraction Traj 1 l1 l1 G1 G1 c21 l2 c21 c20 l2 c20 G2 cc3232 l3l3 1 c30c30 π ππ13 = π32 (1) → π31 (1) → π ππ3 = π32 (1) → π31 (1) → (1) c31c31π ππ232 = cc3434π31 G G33 → π33 (1) → π ππ3 = π31 (1) → π33 (1) → User 1's hierarchical graph HG1 time s 81 s 71 s 61 s 1 2 c30 s s 31 c31 s 1 1 1 5 G2 c32 → π (1), ll3 π33 (2) → π32 (2) → π33 (1) 32 π33 (2) → π32 (2) → π33 (1) → c π32 (1),cc33 π32 (1) → π31 (2) → π32 (1) → π31 (1),G3 c (1), c (1) → πc34 31 G3 π32 (1) → π31 (2) → πc32 31 3 34 30 User n’s personal hierarchical graph s 82 s time s 72 2 6 s 52 s 14 c32 33 32 s 42 s 32 c32 s 22 s 12 c33 c31 c33 c34 Measuring Similarity (3) • Sequence matching – We aim to find out the maximum-length similar sequence – A pair of similar sequence: two individuals share the property of visiting the same sequence of places with a similar time interval Same visiting order: ai == bi Similar transition time: B 5h A 8h C 6h B u1 u2 7h A B C 6.5h 14 h A(1)ο C(1)ο B(1)ο A(2) Bο A Aο C B(1)ο A(1)ο C(2)ο B(2) A βπ‘π − βπ‘π′ ≤π max(βπ‘π , βπ‘π′ ) Aο B X X Aο Bο C √ , Measuring Similarity (4) , • Similarity Calculating – Two factors • The length of the matched similar sequence • The level of the matched similar sequence – Calculation 1. Calculating similarity score for each sequence (weighted by its length) 2. Adding up similarity score of each sequence found on a level 3. Weighted Summing up the score of multiple levels π π (π ) = πΌ(π ) 1 ππ = π1 ∗ π2 π=1 min k π , k π ′ π π=1 π π π» πππ£ππππ = π=1 π½π ππ (2) Measuring Similarity (5) User 1: User3> User 2 Layer 1 Layer 1 G1 G1 Layer 2 G2 High B e a A User 1: Aο B User 2: Aο B B Layer 3 c d b G3 Layer 2 G2 High A Low Aο B Layer 3 c Low e User 1: aο c ο e User 2: bο d a b G3 Aο B User 1: Aο B User 3: Aο B cο e User 1: aο c ο e User 3: bο c ο e Experiments (1) • GPS Devices and Users – 112 users collecting the data in the past year age<=22 22<age<=25 26<=age<29 age>=30 Microsoft emplyees Employees of other companies Government staff Colleage students 9% 16% 18% 30% 14% 45% 58% 10% Experiments (2) • GPS dataset – > 6 million GPS points – > 170,000 kilometers – 36 cities in China and a few city in the USA, Korea and Japan Experiments (3) • Evaluation approach – Evaluated as an information retrieval problem – Ground truth: Users label the relationship with a ratings show in this Table Relevance level Relationships suggestion 4 Strongly similar Family members/intimate lovers/roommate 3 Similar Good friends/workmates/classmates 2 Weakly similar Ordinary friends, neighbors in a community 1 Different Strangers in the same city 0 Quite different Strangers in other cities A query user Retrieve Similar Users Relationship matrix U1, U2, Ui, …, Un U1, U2, ... Ui Top Ten Similar Users (U2, U3,…, U4) 3, 4, 0, 1, G=(4, 3, 2, 3,0,1,…,0,0 ) 0, 2, Calculating nDCG and MAP 0 1 3 ( 4,3, 3, 2, 2, 1,…,0,0 ) ... 3, 2, 2, 0, .... 2, 3, Un 1 Get Ground Truth Experiments (4) • Comparing with baselines – The Pearson Correlation – Cosine Similarity 0.96 0.92 MAP 0.88 0.84 0.8 0.76 0.72 Methods Experiments (5) • NDCG comparison 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 0.78 nDCG@ 5 nDCG@10 Methods Conclusion • A hierarchical graph – A uniform framework to measure various users’ location histories – Effectively modeling users’ outdoor movements • Sequentially • Hierarchically • Our similarity measurement outperformed existing methods – The Person measurement and – Cosine similarity measurement – Hierarchy + Sequence achieved the best performance Thanks! Microsoft Research Asia yuzheng@microsoft.com