Classification and clustering methods development and implementation for unstructured documents collections by Osipova Nataly St.Petesburg State University Faculty of Applied Mathematics and Control Processes Department of Programming Technology Contents Introduction Methods description Information Retrieval System Experiments Contextual Document Clustering was developed in joined project of Applied Mathematics and Control Processes Faculty, St. Petersburg State University and Northern Ireland Knowledge Engineering Laboratory (NIKEL), University of Ulster. Definitions Document Terms dictionary Dictionary Cluster Word context Context or document conditional probability distribution Entropy Document conditional probability distribution Document x y tf(y) p(y|x) y – words word1 5 5/m tf(y) – y frequency word2 10 10/m word3 6 6/m p(y|x) – y conditional probability in document x … wordn m – document x size 16 16/m (5/m, 10/m,6/m,…,16/m ) – document conditional probability distribution Word context Word w Document x1 … Document x2 Document xk y tf(y) p(y|x1) y tf(y) p(y|x1) y tf(y) p(y|x1) word1 5 5/m1 word1 7 7/m1 word1 20 20/mk word2 10 10/m1 word3 12 12/m1 word4 9 9/mk 3 3/mk … wordn1 … … 16 16/m1 wordn2 4 4/m1 wordnk y tf(y) p(y|w) word1 5+7+20=32 32/m word2 10 10/m word3 12 12/m 3 3/m … wordnk … Context conditional probability distribution Contents Introduction Methods description Information Retrieval System Experiments Methods document clustering method dictionary build methods document classification method using training set Information retrieval methods: keyword search method cluster based search method similar documents search method Contextual Documents Clustering Documents Dictionary Distances calculation Clusters Narrow context words Entropy y context conditional probability distribution p1 p2 pn p1+p2+…+pn=1 n H H( p1 p2 pn ) pi * log( pi ) i 1 Uncertainly measure, here it is used to characterize commonness (narrowness) of the word context. Contextual Document Clustering maxH(y)=H ( ) Entropy H ([ p1, p2 ]) ( p1 log p1 p2 log p2 ) p1 , p2 1 log 2 (2) 0 H( 0.5 ) H( α 1 ) H( ) Word Context - Document Distance y context conditional probability distribution Average conditional probability distribution Document x conditional probability distribution p1 p 12 p1 12 p 2 p2 Word Context - Document Distance JS[p1,p2]=H( ) - 0.5H( ) - 0.5H( ) Jensen-Shannon divergence JS{ 1 , 1 }[ p1, p 2] 0 2 2 JS{ 1 , 1 }[ p1, p 2] 0 p1 p 2 2 2 Dictionary construction Why: - big volumes: 60,000 documents, 50,000 words => 15,000 words in a context - narrow context words importance Dictionary construction Delete words with 1. High or low frequency 2. High or low document frequency 3. 1. and 2. Retrieval algorithms keyword search method cluster based search method search by example method Keyword search method Document 1 Document 2 Document 3 Document 4 word 1 word 10 word 15 word 11 word 2 word 25 word 2 word 21 word 3 word 30 word 32 word 3 … … … … word n1 word n2 word n3 word n4 Request: word 2 Result set: document 1 document3 Cluster based search method Documents Documents Documents Cluster 1 Cluster 2 Cluster 3 word 1 word 12 word 1 word 2 word 26 word 23 … … … word n1 word n2 word n3 Request: word 1 Result set: Cluster 1 Cluster 3 Cluster context words Similar documents search Minimal Spanning Tree Cluster name document 1 document 4 document 2 document 5 document 3 document 6 document 7 Cluster Request: document 3 Result set: document 6 document 7 Document classification: method 1 Test documents Clusters List of topics Training set Topics contexts Distances between topics and clusters contexts Classification result: cluster1 – topic 10 cluster 2 – topic 3 … cluster n – topic 30 Document classification: method 2 Test documents All documents set Training set Clusters Classification result: cluster1 – topic 10 cluster 2 – topic 3 … cluster n – topic 30 Topics list Contents Introduction Methods description Information Retrieval System Experiments Information Retrieval System Architecture Features Use Information Retrieval System architecture. data base server client IRS architecture Data Base Server MS SQL Server 2000 Local Area Network “thick” client C# Data Base IRS architecture DBMS MS SQL Server 2000: High-performance Scalable Secure Huge volumes of data treat T/SQL Stored procedures IRS features In the IRS the following problems are solved: document clustering keyword search method cluster based search method similar documents search method document classification with the use of training set DB structure The Data Base of the IRS consists of the following tables: documents all words dictionary dictionary table of relations between documents and words: document-word words contexts words with narrow contexts clusters intermediate tables for main tables build and for retrieve realization Algorithms implementation Documents All words dictionary Dictionary Keyword search Table “document-word” Words contexts Clusters Words with narrow contexts Centroid Cluster based search Similar documents search Similar documents search 0,26967 document1 document2 0,211 0,57231 document5 0,1011 0,7231 0,8731 0,16285 document3 0,23851 0,98154 document4 Cluster Minimal Spanning Tree Cluster name document 1 document 4 document 2 document 5 document 3 Cluster Similar documents search Clusters table Distances table Tree table Similar documents search IRS use IRS use IRS use IRS use IRS use IRS use Contents Introduction Methods description Information Retrieval System Experiments Experiments Test goals were: algorithm accuracy test different classification methods comparison algorithm efficiency evaluation Experiments 60,000 documents 100 topics Training set volume = 5% of the collection size Experiments tf ( y ) 5, tf ( y ) 1000 df ( y ) 2, df ( y ) 1000 Result analysis - Russian Information Retrieval Evaluation Seminar - Such measures as macro-average recall precision F-measure were calculated. Recall Recall 0.6 0.5 xxxx textan textan xxxx 0.4 xxxx 0.3 0.2 0.1 xxxx xxxx xxxx xxxx xxxx xxxx xxxx 0 Systems xxxx xxxx xxxx xxxx xxxx Precision Precision 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 xxxx xxxx xxxx textan xxxx xxxx xxxx textan xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Systems xxxx F-measure F-measure 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 xxxx textan textan xxxx xxxx xxxx xxxx xxxx xxxx xxxxxxxx Systems xxxx xxxx xxxx xxxx xxxx xxxx Result analysis List of some topics test documents were classified in № Category 1 Family law 2 Inheritance law 3 Water industry 4 Catering 5 Inhabitants’ consumer services 6 Rent truck 7 International law of the space 8 Territory in international law 9 Off-economic relations fellows 10 Off-economic dealerships 11 Economy free trade zones. Customs unions. Result analysis Recall results for every category. Results which were the best for the category are selected with bold type. All results are set in percents. С V 1 2 3 4 5 6 7 8 9 10 11 textan 33 34 35 60 46 26 27 98 75 25 100 xxxx 1 0 0.2 3 4 0 0.9 0 3 0 2 xxxx 0 0 4.3 2.3 0 5 0.9 8 3 0 0.8 xxxx 55 86 75 19 59 51 80 0 41 82 0 xxxx 21 39 2 22 15 6 0 1.4 0 5 0 xxxx 40 43 16 11 25 23 10 1.4 1.2 5 0 xxxx 23 4 2.5 1.1 18 7 0.9 0 1.2 10 0 xxxx 2.7 0 0 0 1.5 0 0 0 0 0 0 xxxx 2.2 0 0 0 1.5 0 0 0 0 0 0 xxxx 37 21 12 22 18 27 51 0 0 0 0 Thank you for your attention!