Social Network Extraction of Academic Researchers Jie Tang, Duo Zhang, and Limin Yao Tsinghua University Oct. 29th 2007 1 Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary 2 Motivation • More and more online social networks become available – e.g., YouTube.com, Facebook.com, etc. • However, the social networks are usually separated • A question arises: can we build a integrated social network from the separated ones automatically? • As a case study, how to build an social network automatically for academic community? – ArnetMiner.org 3 Motivating Example Ruud Bolle 2 Office: 1S-D58 Letters: IBM T.J. Watson Information Research Center Contact P.O. Box 704 Ruud Bolle Office: 1S-D58 Yorktown Heights, NY 10598 USA IBM T.J. WatsonCenter Research Center Letters: Packages: IBM T.J. Watson Research Skyline Drive P.O. Box19704 Hawthorne, NY10598 10532USA USA Yorktown Heights, NY Email: Packages: IBMbolle@us.ibm.com T.J. Watson Research Center 19 Skyline Drive Ruud M. Bolle was born in Voorburg, The Netherlands. He received the Bachelor's Hawthorne, NY 10532 USA Degree in Analog Electronics 1977 and the Master's Degree in Electrical Email: inbolle@us.ibm.com Engineering in 1980, both from Delft University of Technology, Delft, The In 1983 he received Master'sEducational Degree in Applied Mathematics and in Ruud M.Netherlands. Bolle was born in Voorburg, Thethe Netherlands. He received thehistory Bachelor's the Ph.D. in Electrical Engineering from Brown University, Providence, Rhode Degree 1984 in Analog Electronics in 1977 and the Master's Degree in Electrical Island. In 1984 he from became Research of Staff Member atDelft, the IBM Engineering in 1980, both Delfta University Technology, TheThomas J. Watson Research Center in the Artificial Intelligence Department the Computer Netherlands. In 1983 he received the Master's Degree in Applied of Mathematics andScience in In 1988Engineering he became from manager of University, the newly formed Exploratory 1984 theDepartment. Ph.D. in Electrical Brown Providence, RhodeComputer Vision whichaisResearch part of theStaff Math Sciences Department. Island. In 1984Group he became Member at the IBM Thomas J. Watson Research Center in the Artificial Intelligence Department of the Computer Science Currently, hishe research are onformed video database indexing, video Department. In 1988 becameinterests manager offocused the newly Exploratory Computer processing, visual interaction and biometrics applications. Vision Group which is part human-computer of the Math Sciences Department. video database indexing video processing visual human-computer interaction biometrics applications 1 IBM T.J. Watson Research Center Research Staff Affiliation 2006 Position Homepage Photo Name Ruud Bolle 1984 Sharat Chikkerur, Sharath Pankanti, Alan Jea, Nalini K. Ratha, Ruud M. Bolle: Fingerprint 49 EE Representation Using Localized Texture Features. ICPR (4) 2006: 521-524 2Andrew Senior, Arun Hampapur, Ying-li Tian, Lisa Brown, Sharath Pankanti, Ruud M. Bolle: 48 EE Appearance models for occlusion handling. Image Vision Comput. 24(11): 1233-1243 (2006) Msuniv Delft University of Technology 47 46 ... 4 1 Bsdate 1977 Bsuniv Delft University of Technology Bsmajor Msmajor Msmajor Electrical Engineering Applied Mathematics Co-author Co-author Publication 2# Publication 1# Title Title Cancelable Biometrics: A Case Study in Venue Fingerprints Date End_page Start_page ICPR 2005 1Ruud M. Bolle, Jonathan H. Connell, Sharath Pankanti, Nalini K. Ratha, Andrew W. Senior: EE The Relation between the ROC Curve and the CMC. AutoID 2005: 15-20 Sharat Chikkerur, Venu Govindaraju, Sharath Pankanti, Ruud M. Bolle, Nalini K. Ratha: EE 2 Novel Approaches for Minutiae Verification in Fingerprint Images. WACV. 2005: 111-116 Ruud Bolle Analog Electronics 1980 1Nalini K. Ratha, Jonathan Connell, Ruud M. Bolle, Sharat Chikkerur: Cancelable Biometrics: 50 EE A Case Study in Fingerprints. ICPR (4) 2006: 370-373 bolle@us.ibm.com Email Phddate Phduniv Phdmajor Msdate Brown University Publications DBLP: Ruud Bolle IBM T.J. Watson Research Center P.O. Box 704 Address Yorktown Heights, NY 10598 USA Address http://researchweb.watson.ibm.com/ ecvg/people/bolle.html Electrical Engineering Ruud M. Bolle interests is a Fellow the IEEE thedatabase AIPR. Heindexing, is Area Editor Currently, his research areoffocused onand video video of Computer Vision andhuman-computer Image Understanding and Associate Editor applications. of Pattern Recognition. Ruud processing, visual interaction and biometrics Academic services M. Bolle is a Member of the IBM Academy of Technology. Ruud M. Bolle is a Fellow of the IEEE and the AIPR. He is Area Editor of Computer Vision and Image Understanding and Associate Editor of Pattern Recognition. Ruud M. Bolle is a Member of the IBM Academy of Technology. IBM T.J. Watson Research Center 19 Skyline Drive Hawthorne, NY 10532 USA Research_Interest 370 Fingerprint Representation Using Localized Texture Features Venue End_page Start_page 2006 2006 521 ICPR 373 coauthor Publication #3 affiliation 524 UIUC Ruud Bolle 2 Publication #5 ... Date coauthor position Professor Motivating Example Ruud Bolle 2 Office: 1S-D58 Letters: IBM T.J. Watson Information Research Center Contact P.O. Box 704 Ruud Bolle Office: 1S-D58 Yorktown Heights, NY 10598 USA IBM T.J. WatsonCenter Research Center Letters: Packages: IBM T.J. Watson Research Skyline Drive P.O. Box19704 Hawthorne, NY10598 10532USA USA Yorktown Heights, NY Email: Packages: IBMbolle@us.ibm.com T.J. Watson Research Center 19 Skyline Drive Ruud M. Bolle was born in Voorburg, The Netherlands. He received the Bachelor's Hawthorne, NY 10532 USA Degree in Analog Electronics 1977 and the Master's Degree in Electrical Email: inbolle@us.ibm.com Engineering in 1980, both from Delft University of Technology, Delft, The In 1983 he received Master'sEducational Degree in Applied Mathematics and in Ruud M.Netherlands. Bolle was born in Voorburg, Thethe Netherlands. He received thehistory Bachelor's the Ph.D. in Electrical Engineering from Brown University, Providence, Rhode Degree 1984 in Analog Electronics in 1977 and the Master's Degree in Electrical Island. In 1984 he from became Research of Staff Member atDelft, the IBM Engineering in 1980, both Delfta University Technology, TheThomas J. Watson Research Center in the Artificial Intelligence Department the Computer Netherlands. In 1983 he received the Master's Degree in Applied of Mathematics andScience in In 1988Engineering he became from manager of University, the newly formed Exploratory 1984 theDepartment. Ph.D. in Electrical Brown Providence, RhodeComputer Vision whichaisResearch part of theStaff Math Sciences Department. Island. In 1984Group he became Member at the IBM Thomas J. Watson Research Center in the Artificial Intelligence Department of the Computer Science Currently, hishe research are onformed video database indexing, video Department. In 1988 becameinterests manager offocused the newly Exploratory Computer processing, visual interaction and biometrics applications. Vision Group which is part human-computer of the Math Sciences Department. video database indexing video processing visual human-computer interaction biometrics applications 1 Two key issues: IBM T.J. Watson Research Center Research Staff Affiliation IBM T.J. Watson Research Center P.O. Box 704 Address Yorktown Heights, NY 10598 USA Address http://researchweb.watson.ibm.com/ ecvg/people/bolle.html Position Homepage Photo Name Ruud Bolle 1984 Brown University Analog Electronics 1980 Msuniv Delft University of Technology Msmajor Msmajor Electrical Engineering Applied Mathematics Co-author Co-author Publication 2# Title Title 1Nalini K. Ratha, Jonathan Connell, Ruud M. Bolle, Sharat Chikkerur: Cancelable Biometrics: 50 EE A Case Study in Fingerprints. ICPR (4) 2006: 370-373 Sharat Chikkerur, Sharath Pankanti, Alan Jea, Nalini K. Ratha, Ruud M. Bolle: Fingerprint 49 EE Representation Using Localized Texture Features. ICPR (4) 2006: 521-524 2Andrew Senior, Arun Hampapur, Ying-li Tian, Lisa Brown, Sharath Pankanti, Ruud M. Bolle: 48 EE Appearance models for occlusion handling. Image Vision Comput. 24(11): 1233-1243 (2006) Cancelable Biometrics: A Case Study in Venue Fingerprints 1Ruud M. Bolle, Jonathan H. Connell, Sharath Pankanti, Nalini K. Ratha, Andrew W. Senior: EE The Relation between the ROC Curve and the CMC. AutoID 2005: 15-20 Sharat Chikkerur, Venu Govindaraju, Sharath Pankanti, Ruud M. Bolle, Nalini K. Ratha: EE 2 Novel Approaches for Minutiae Verification in Fingerprint Images. WACV. 2005: 111-116 ... Date End_page Start_page ICPR 2005 5 Bsdate 1977 Bsuniv Delft University of Technology Bsmajor Publication 1# 2006 46 Ruud Bolle 1 • How to accurately extract the researcher profile information from the Web? Academic services • How to integrate the information from different sources? Publications Ruud M. Bolle interests is a Fellow the IEEE thedatabase AIPR. Heindexing, is Area Editor Currently, his research areoffocused onand video video of Computer Vision andhuman-computer Image Understanding and Associate Editor applications. of Pattern Recognition. Ruud processing, visual interaction and biometrics M. Bolle is a Member of the IBM Academy of Technology. Ruud M. Bolle is a Fellow of the IEEE and the AIPR. He is Area Editor of Computer Vision and Image Understanding and Associate Editor of Pattern Recognition. Ruud M. Bolle is a Member of the IBM Academy of Technology. 47 bolle@us.ibm.com Email Phddate Phduniv Phdmajor Msdate Electrical Engineering DBLP: Ruud Bolle IBM T.J. Watson Research Center 19 Skyline Drive Hawthorne, NY 10532 USA Research_Interest 370 Fingerprint Representation Using Localized Texture Features Venue End_page Start_page 2006 2006 521 ICPR 373 coauthor Publication #3 affiliation 524 UIUC Ruud Bolle 2 Publication #5 ... Date coauthor position Professor Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary 6 Related Work – Person Profiling • Profile Information Extraction – E.g., Yu et al. (2005), resume IE – Alani et al. (2003), Artequakt system • Contact Information Extraction – E.g., Kristjansson et al. (2004), Interactive extraction – Balog and Rijke (2006), Heuristic rules • Information Extraction Methods – E.g., HMM (Ghahramani, 1997), – MEMM (McCallum, 2000), – CRFs (Lafferty, 2001) 7 Related Work – Name Disambiguation • Unsupervised Methods – – • Supervised Methods – – • Support Vector Machines, Naïve Bayes, etc. E.g. Han (2004) Graph-based Approach – – 8 Hierarchy clustering, K-way spectral clustering, etc. E.g. Han (2005), Mann (2003), Tan (2006) Random Walk, etc. E.g. Bekkerman (2005), Malin (2005), Minkov (2006) Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary 9 Researcher Social Network Extraction 70.60% of the researchers have at least one homepage or an introducing page Research_Interest Fax Affiliation Title 85.6% from universities 14.4% from companies Start_page 71.9% are homepages End_page 40% are in lists and tables 28.1% are introducing pages 60% are natural language text Phone Postion Publication_venue Address Person Photo Email Homepage Publication Name Authored Coauthor Researcher Bsdate Bsuniv Phddate Phduniv Phdmajor Msdate Bsmajor Msuniv Msmajor Date There are a large number of person names having the ambiguity problem Even 3 “Yi Li” graduated the author’s lab 70% moved at least one time 10 Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary 11 Markov Random Field Ya Yb Special Cases: - Conditional Random Fields - Hidden Markov Random Fields Yc Ye Yd Yf Markov Property: 12 P(Yi | Y j | Y j Yi ) P(Yi | Y j | Y j ~ Yi ) CRFs - Green nodes are hidden vars, - Purple nodes are observations … … … ADR … ADR AFF AFF AFF AFF AFF AFF POS POS POS POS POS POS OTH OTH OTH OTH OTH OTH He is a Professor at 1 p( y | x) exp j t j (e, y |e , x) k sk (v, y |v , x) Z ( x) vV ,k eE , j 13 UIUC Processing Flow for Profiling 1 Preprocessing Train 2 Tagging Standard word He obtained his BS in Computer Science in 1999... Determine Tokens Special word Image Token Ruud M. Bolle is a Fellow of the IEEE... Inputted docs Assigning tags AUC ALC FUC AMC PRV DEL AUC ALC FUC AMC PRV RPA DEL AUC ALC FUC AMC Punc. mark ….. Ruud Test Bolle is a Fellow of the AUC ALC FUC AMC AUC ALC FUC AMC IEEE Labeling data Model Learning 3 ALC Feature definitions PRV ALC Learning a CRF model PRV FUC A unified tagging model RPA AUC ALC FUC AMC PRV DEL AUC ALC FUC AMC PRV DEL PRV RPA DEL ….. obtained his BS in Computer Science Labeled data 14 PSB PRV DEL Term ... Document Ruud M. Bolle is a Fellow of the IEEE and the AIPR. He is Area Editor of Computer Vision and Image Understanding and Associate Editor of Pattern Recognition. Ruud M. Bolle is a Member of the IBM Academy of Technology... PRV DEL AUC ALC FUC AMC PSB PRV DEL ….. Ruud Bolle is a Fellow of Tagging results the IEEE Token Definitions Standard word Standard word 15 Words in natural language Special word Special word Including several general ‘special words’ e.g. email address, IP address, URL, date, number, money, percentage, unnecessary tokens (e.g. ‘===’ and ‘###’), etc. Image token Image token <IMAGE src="defaul3.jpg" alt=""/> Term Term Punctuation Punctuation marks marks base NP, “Computer Science” base NP,like like “Computer Science” Including period, question mark, and and Including period, question mark, exclamation mark exclamation mark Possible Tag Assignment Token type Possible tags Standard word All possible tags Special word Position, Affiliation, Address, Email, Phone, Fax, Phd/Ms/Bs-date Image token Photo, Email Term token Position, Affiliation, Address, Phd/Ms/Bsuniv, Phd/Ms/Bs-major Position, Affiliation, Address, Email, Phone, Punctuation marks Fax, Phd/Ms/Bs-date 16 Feature Definition • Content features Word features Morphological features Image size Image height/width ratio Image format Image color Face recognition The value of height/width. The value of a person photo is often larger than 1 JPG or BMP The number of the “unique color” used in the image and the number of bits used for per pixel, i.e. 32,24,16,8,1 Whether the current image contains a person face Image filename Whether the filename contains (partially) the researcher name Image “ALT” Whether the “alt” of the image contains (partially) the researcher name Image positive keywords Image negative keywords 17 Standard Word Whether the current token is a word Whether the word is capitalized Image Token The size of the image “myself”, “biology” “ads”, “banner”, “logo” Feature Definition • Pattern features Positive words “Fax:” for Fax, “director” for Position Whether the current token is a special Special tokens word • Term features 18 Term Whether the current token is a term Dictionary Whether the current token is included in a dictionary Our Method to Name Disambiguation y4=2 t -coauthor y7=2 y1=1 cite coauthor y10=3 y5=2 y6=2 co-conference y3=1 y2=1 coauthor cite coauthor co-conference cite • A hidden Markov Random Field model y9=3 y11=3 y8=1 coauthor x4 • Hidden Variables Y represent the labels of publications x9 x7 x1 x5 x3 • Observable Variables X represent publications x10 x6 x2 x11 x8 19 • Constraints define the dependencies over hidden variables Objective Function maximize P (Y | X ) P (Y ) P ( X | Y ) 1 exp(V (Y )) Z1 1 exp( VNi (Y )) Z1 Ni N P( X | Y ) 1 exp(V (i, j )) Z1 i j 1 exp( D( xi , yi )) Z2 xi X 1 exp( D( xi , x j ) I ( yi y j ) [ wk ck ( yi , y j )]) Z1 i j ck C 2 1 2 minimize fobj {D( xi , x j ) I ( yi y j ) [ wk ck ( yi , y j )]} D( xi , yi ) log Z i 20 j ck C xi X Constraint Definition C c1 c2 c3 c4 c5 W w1 w2 w3 w4 w5 c6 w6 Constraint Name CoOrg CoAuthor Citation CoEmail Feedback τ-CoAuthor Description ai(0).affiliation = aj(0).affiliation r, s>0, ai(r)=aj(s) pi cites pj or pj cites pi ai(0).email = aj(0).email Constraints from user feedback one common author in τ extension p1: A, B, C p2: A, B p3: A, D p4: C, D 21 (0) (3) (2) Mp(1) : p1 p1 1 p2 1 0 p3 01 p2 01 1 01 p3 0 1 1 0 1 Parameterized Distance Function We define our distance function as follows: D( xi , x j ) 1 xiT Ax j || xi ||A || x j ||A where || xi ||A xiT Axi 22 We can see that || xi ||A actually maps each vector xi into another new space, i.e. A1/2xi To simplify our question, we define A as a diagonal matrix EM Framework • Initialization • use constraints to generate initial k clusters f obj ( xi , yi ) {D( xi , x j ) I (h l j ) [ wk ck ( pi , p j )]} D( xi , yi ) • E-Step • M-Step i ck C j x i :li h i • Update cluster centroid y || x || • Update parameter matrix A i i :li h f obj am { i D( xi , x j ) am 23 j D( xi , x j ) am I (li l j ) [ wk ck ( pi , p j )]} ck C xim x jm || xi ||A || x j ||A x Ax j T i i A D( xi , yi ) am xi X 2 xim || xi ||2A x 2jm || x j ||2A || xi ||2A || x j ||2A 2 || xi ||A || x j ||A Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary 24 Profiling Experiments • Dataset – IK researchers from ArnetMiner.org • Baseline – Amilcare – Support Vector Machines – Unified_NT (CRFs without transition features) • Evaluation measures – Precision, Recall, F1 25 Profiling Results—5-fold cross validation 26 Profiling Task Unified Unified_NT SVM Amilcare Photo 89.11 88.64 88.86 31.62 Position 69.44 64.70 64.68 56.48 Affiliation 83.52 72.16 73.86 46.65 Phone 91.10 78.72 79.71 83.33 Fax 90.83 64.28 64.17 86.88 Email 80.35 75.47 79.37 78.70 Address 86.34 75.15 77.04 66.24 Bsuniv 67.38 57.56 59.54 47.17 Bsmajor 64.20 59.18 60.75 58.67 Bsdate 53.49 40.59 28.49 52.34 Msuniv 57.55 47.49 49.78 45.00 Msmajor 63.35 61.92 62.10 57.14 Msdate 48.96 41.27 30.07 56.00 Phduniv 63.73 53.11 57.01 59.42 Phdmajor 67.92 59.30 59.67 57.93 Phddate 57.75 42.49 41.44 61.19 Overall 83.37 83.37 72.09 73.57 62.30 Contribution of Features 84 content 82 content+term content+pattern 80 all F1 measure 78 76 74 72 70 68 27 Features Disambiguation Experiments • Data Sets: Abbreviated Name dataset Name Set Publications Name Variations C. Chang 402 G. Wu 28 Real Name dataset Name Affiliation Publication 97 Shanghai Jiao Tong Univ. 6 152 46 Dept. of Automation, Tsinghua Univ. 3 K. Zhang 293 40 Alabama Univ. 8 J. Li 551 102 Univ. of California, Davis 4 B. Liang 55 14 Carnegie Mellon University 5 M. Hong 108 30 State Univ. of New York at Albany 4 National Univ. of Singapore 6 X. Xie 136 36 South China Univ. of Technology 2 P. Xu 39 5 George Mason Univ. 2 H. Xu 182 60 Chinese Academy of Sciences 5 W. Yang 263 82 Univ. of Washington 3 Nanjing Normal Univ. 4 Jing Zhang (54/25) Yi Li (42/22) Experiment Setup • Baseline Method Unsupervised Hierarchical Clustering Method • Measurement #PairsCorrectlyPredictedToSameAuthor Pairwise _ Precision TotalPairsPredictedToSameAuthor # PairsCorrectlyPredictedToSameAuthor Pairwise_Recall TotalPairsToSameAuthor 2 Precision Recall Pairwise _ F1 measure Precision+Recall 29 Disambiguation Results Name C. Chang G. Wu K. Zhang J. Li B. Liang M. Hong X. Xie P. Xu H. Xu W. Yang Average 30 Unsupervised Hierarchical Clustering Precision Recall F1 0.65 0.59 0.62 0.71 0.62 0.66 0.75 0.60 0.67 0.62 0.52 0.57 0.82 0.76 0.79 0.79 0.65 0.71 0.77 0.73 0.75 0.89 0.95 0.92 0.65 0.59 0.62 0.71 0.62 0.66 0.75 0.60 0.67 Constraint-based Probabilistic Framework Precision Recall F1 0.73 0.67 0.70 0.75 0.75 0.75 0.79 0.71 0.75 0.66 0.59 0.62 0.85 0.89 0.87 0.82 0.75 0.78 0.83 0.82 0.82 0.94 1.00 0.97 0.73 0.67 0.70 0.75 0.75 0.75 0.79 0.71 0.75 Contribution of Different Constraint Baseline No-Reference Constraint Combination No-k-Author No-CoOrg No-CoEmail No-Coauthor Yi Li Jing Zhang Reference k-Author CoOrg CoEmail CoAuthor All 0.000 31 0.200 0.400 0.600 0.800 Pairwise F1-Measure 1.000 How Profiling and Disambiguation Help Expert Finding • Expert finding by using a PageRank-based method 1 EF EF+RPE EF+RPE+ND 0.8 0.6 0.4 0.2 0 P@5 32 P@10 P@20 P@30 R-prec MAP bpre MRR Outline • Motivation • Related Work • Problem Description • Our Approach • Experimental Results • Summary 33 Summary • Investigated the problem of researcher social network extraction • Proposed a unified approach to perform profiling and a constraint-based probabilistic model to name disambiguation • Experimental results show that our approaches outperform the baseline methods • When applying it to expert finding, we obtain a significant improvement on performances 34 Thanks! Q&A HP: http://keg.cs.tsinghua.edu.cn/persons/tj/ Online Demo: http://arnetminer.org 35