slides - Charu Aggarwal

On Node Classification in Dynamic Content-based Networks Charu Aggarwal Nan Li IBM T. J. Watson Research Center charu@us.ibm.com University of California, Santa Barbara nanli@cs.ucsb.edu Presented by Nan Li Motivation Ke Wang Jian Pei “Mining” “Sequential Pattern” “Efficient” “Data Mining” “Association Rules” “Systems” … “Rules” Jiawei Han … “Data Mining” “Databases” “Clustering” “Sequential Pattern” … Kenneth A. Ross “Algorithms” … Year 22001 Motivation Ke Wang Jian Pei “Association Rules” “Pattern” “Data Mining” “Data Mining” “Ranking” “Stream” “Web” “Semantics” Jiawei Han … … Marianne Winslett “Data Mining” “Web” “Parallel” “Sequential Pattern” “Automated” … “Data” … Xifeng Yan Philip S. Yu “Clustering” “Distributed” “Pattern Mining” “Databases” … “Mining” … Year 32002 Motivation Ke Wang Jian Pei “Mining” “Sequential Pattern” “Efficient” “Mining” “Association” “Systems” “Rules” Jiawei Han … … Charu Aggarwal “Clustering” “Mining” “Indexing” “Databases” “Knowledge” “Clustering” “XML” “Sequential Pattern” … … Xifeng Yan Philip S. Yu “Algorithms” “Association Rules” “Graph” “Clustering” “Databases” “Wireless” “Sequential Mining” “Web” … … Year 42003 Motivation  Networks annotated with an increasing amount of text • Citation networks, co-authorship networks, product databases with large amounts of text content, etc. • Highly dynamic  Node classification Problem • Often arises in the context of many network scenarios in which the underlying nodes are associated with content. • A subset of the nodes in the network may be labeled. • Can we use these labeled nodes in conjunction with the structure for the classification of nodes which are not currently labeled?  Applications 5 Challenges  Information networks are very large • Scalable and efficient  Many such networks are dynamic • Updatable in real time • Self-adaptable and robust  Such networks are often noisy • Intelligent and selective  Heterogeneous correlations in such networks A B A C B A C B C 6 Outline     Related Works DYCOS: DYnamic Classification algorithm with cOntent and Structure • Semi-bipartite content-structure transformation • Classification using a series of text and linkbased random walks • Accuracy analysis Experiments • NetKit-SRL Conclusion 7 Related Works  Link-based classification (Bhagat et al., WebKDD 2007) • •  Content-only classification (Nigam et al. Machine Learning 2000) •  Each object’s own attributes only Relational classification (Sen et al., Technical Report 2004) • •  Local iterative Global nearest neighbor Each object’s own attributes Attributes and known labels of the neighbors Collective classification (Macskassy & Provost, JMLR 2007, Sen et al., Technical Report 2004, Chakrabarti, SIGMOD 1998) • • Local classification • Flexible: ranging from a decision tree to an SVM Approximate inference algorithms • Iterative classification • Gibbs sampling • Loopy belief propagation • Relaxation labeling 8 Outline     Related Works DYCOS: DYnamic Classification algorithm with cOntent and Structure • Semi-bipartite content-structure transformation • Classification using a series of text and linkbased random walks • Accuracy analysis Experiments • NetKit-SRL Conclusion 9 DYCOS in A Nutshell  Node classification in a dynamic environment  Dynamic network: the entire network is denoted by Gt = (Nt, At, Tt) at time t.  Problem statement:  Classify the unlabeled nodes (Nt \ Tt) using both the content and structure of the network for all the time stamps in an efficient and accurate manner t t+1 t+2 Both the structure and the content of the network change over time! 10 Semi-bipartite Transformation  Text-augmented representation • Leveraged for a random walk-based classification model that uses both links and text • Two partitions: structural nodes and word nodes  • Semi-bipartite: one partition of nodes is allowed to have edges either within the set, or to nodes in the other partition. Efficient updates upon dynamic changes 11 Random Walk-Based Classification  Random walks over augmented structure • Starting node: the unlabeled node to be classified. • Structural hop • A random jump from a structural node to one of its neighbors • Content-based multi-hop  • A jump from a structural node to another through implicit common word nodes • Structural parameter: ps Classification • Classify the starting node with the most frequently encountered class label during the random walks 12 Gini-Index & Inverted Lists  Discriminative keywords • A set Mt of the top m words with the highest discriminative power are used to construct the word node partition. • Gini-index k pi (w)  ni (w) /  n j (w) j 1 k G(w)   p j (w) 2 j 1  of G(w) lies in the range (0, 1). • The value • Words with a higher value of gini-index are more discriminative for classification purposes.   Inverted lists • Inverted list of keywords for each node • Inverted list of nodes for each keyword 13 Analysis  Why do we care? • DYCOS is essentially using Monte-Carlo sampling to sample various paths from each unlabeled node. • Advantage: fast approach • Disadvantage: loss of accuracy • Can we present analysis on how accurate DYCOS sampling is?  Probabilistic bound: bi-class classification • Two classes C1 and C2 • E[Pr[C1]] = f1, E[Pr[C2]] = f2, f1 - f2 = b ≥ 0 • Pr[mis-classification] ≤ exp{-lb2/2}  Probabilistic bound: multi-class classification • k classes {C1, C2, …, Ck} • b-accurate • Pr[b-accurate] ≥ 1 - (k-1)exp{-lb2/2} 14 Outline     Related Works DYCOS: DYnamic Classification algorithm with cOntent and Structure • Semi-bipartite content-structure transformation • Classification using a series of text and linkbased random walks • Accuracy analysis Experiments • NetKit-SRL Conclusion 15 Experimental Results  Data sets • CORA: a set of research papers and the citation relations among them. • Each node is a paper and each edge is a citation relation. • A total of 12,313 English words are extracted from the paper titles. • We segment the data into 10 synthetic time periods. • DBLP: a set of authors and their collaborations • Each node is an author and each edge is a collaboration. • A total of 194 English words in the domain of computer science are used. • We segment the data into 36 annual graphs from year 1975 to year 2010. Name Node # Edge # Class # Labeled Node # CORA 19,396 75,021 5 14,814 DBLP 806,635 4,414,135 5 18,999 16 Experimental Results  NetKit-SRL toolkit • An open-source and publicly available toolkit for statistical relational learning in networked data (Macskassy and Provost, 2007). • Instantiations of previous relational and collective classification algorithms • Configuration • Local classifier: domain-specific class prior • Relational classifier: network-only multinomial Bayes classifier • Collective inference: relaxation labeling  Parameters • 1) The number of most discriminative words, m; 2) The size constraint of the inverted list for each keyword a; 3) The number of top content-hop neighbors, q; 4) The number of random walks, l; 5) The length of each random walk, h; 6) Structure parameter, ps. The results demonstrate that DYCOS improves the classification accuracy over NetKit by 7.18% to 17.44%, while reducing the runtime to only 14.60% to 18.95% of that of NetKit. 17 Experimental Results DYCOS vs. NetKit on CORA Classification Accuracy Comparison Classification Time Comparison 18 Experimental Results Parameter Sensitivity of DYCOS Sensitivity to m, l and h (a=30, ps=70%) CORA Data Sensitivity to a, m and ps (l=3, h=5) DBLP Data 19 Experimental Results Dynamic Updating Time: CORA Time Period 1 2 3 4 5 Update Time (Sec.) 0.019 0.013 0.015 0.013 0.023 Time Period 7 8 9 10 0.014 0.014 0.013 0.011 6 Update Time (Sec.) 0.015 Dynamic Updating Time: DBLP Time Period 1975-1989 1990-1999 2000-2010 Update Time (Sec.) 0.03107 0.22671 1.00154 20 Outline     Related Works DYCOS: DYnamic Classification algorithm with cOntent and Structure • Semi-bipartite content-structure transformation • Classification using a series of text and linkbased random walks • Accuracy analysis Experiments • NetKit-SRL Conclusion 21 Conclusion  We propose an efficient, dynamic and scalable method for node classification in dynamic networks.  We provide analysis on how accurate the proposed method will be in practice.  We present experimental results on real data sets, and show that our algorithms are more effective and efficient than competing algorithms. 22 23 Challenges Information networks are very large • Scalable and efficient Many such networks are dynamic • Updatable in real time • Self-adaptable and robust Such networks are often noisy • Intelligent and selective Heterogeneous correlations in such networks • The correlations between the label of o and the content of o • The correlations between the label of o and the contents and labels of objects in the neighborhood of o • The correlations between the label of o and the unobserved labels of objects in the neighborhood of o 24 Analysis Lemma Consider two classes with expected visit probabilities of f1 and f2 respectively, such that b = f1−f2 > 0. Then, the probability that the class which is visited the most during the sampled random walk process is reversed to class 2, is given by at most e l b 2 / 2 Definition Consider the node classification problem with a total of k classes. We define the sampling process to be b-accurate, if none of the classes whose expected visit probability is less than b of the class with the largest expected visit probability turns out have the largest sampled visit probability.  Theorem The probability that the sampling process results in a b-accurate reported majority class is given by at least 2 1 (k 1)  elb /2 25 Experimental Results Classification accuracy comparison: DBLP Classification time comparison: DBLP 26 Experimental Results Sensitivity to m, l and h Sensitivity to a, l and h Sensitivity to m, a and ps Sensitivity to a, m and ps 27

slides - Charu Aggarwal

Related documents

Products

Support

slides - Charu Aggarwal

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib