1
2
1 Department of Information Management,Wuhan University, wuhan, China
2 National Science Library, Chinese Academy of Sciences, Beijing, China
(fyan@whu.edu.cn)
Abstract - Reference to the achievement of domestic and international co-word analysis, learning from DLG association mining algorithm, and propose the ternary co-word analysis algorithm that based on bit vector and the map coordinates analysis method of the ternary Co-word results. As an empirical, domestic knowledge discovery journal papers key words are studied, and this study finds the results of ternary word has some practical significance,and also,the ternary co-word analysis has some practical value.
Keywords co-word analysis, co-occurrences, knowledge discovery knowledge discovery in the introduction of the concept was first mentioned in the field of data mining
[6,7],therefore,knowledge discovery usually refers specifically to the knowledge discovery in database.Almost simultaneously generated a knowledge discovery method which is based on non-related literatures (Literature-Based
Discovery of LBD) [8,9].
Knowledge discovery is in order to find reliable, useful knowledge,which is not yet known and can be non-trivial for people to understand the rules and mode as the core method of data mining and data processing.The paper focuses on the design of data mining algorithms. And
I.
INTRODUCTION knowledge discovery among non-ralated literatures is based on the relationship of scientific literatures which are
Co-word analysis is a method of text analysis,which is based on the text’s content, it was first described in the not co-citation or co-occurrence between the mining objects. literatures in 1970s.and until now,there has been experienced three generations of co-word analysis method,
B.The establishment of data sets and specifications the first generation is based on the co-word analysis of the inclusive index and the adjacent index, the second generation is based on the co-word analysis of strategic
The experiment found knowledge exists in research papers which publish on the Chinese Core Journals and coordinates method, and the third generation took the structure of the database content analysis as the co-word make them as the source data, retrieving the CNKI database to select the search path,which is {(theme = knowledge analysis method [1]. and in this paper, co-word analysis is used to analyze the theme and perspective areas[3] of the discovery) AND core journals}, the retrieval date is July 13,
2010, and finally returned 1764 results. research areas through the study of professional literature in a field with the keywords’ co-occurrence distribution structure[2], and during the process of development ,changes and trends in the field [4], optimizing the information retrieval results [5].
So far, the co-word analysis study is all about the method of binary word co-occurrence phenomenon analysis,not yet
Actually,the 1764 results with 6810 keywords and the average amount of keywords is 3.9 for each paper.but these key words maybe the synonyms (eg, data mining and data- mining), such as english keywords (eg.the Association, rules), english abbreviations (eg. KDD, the SVM, the IDS, etc.), and some other various forms of keywords.so there is a need to do a simple standardized, unified alternative discovered the phenomenon of multi-phrase co-occurrence. this paper take china’s knowledge discovery research area words to replace the different forms of keywords, and as a result of the processing, we could rank the top 20 as a case study, try the method of special case,which is the ternary co-word co-occurrence study of multi-component high-frequency keywords in Table Ⅰ.
T ABLE
Ⅰ
HIGH FREQUENCY KEYWORDS ( TOP 20) words, and focuses on the design of ternary words co-occurrence algorithm and ternary co-word analysis Keyword Frequency Keyword Frequency method.
Knowledge
Discovery
564
Genetic
Algorithms
38
Data Mining 485 Clustering 36
II.
DATA SET
KDD KDD 163 Decision Tree 35
Rough Sets 161 Machine learning 31
A.Select a Test Field
Knowledge Discovery is the process of discovering new knowledge from the known mass data by some means or technology(Knowledge Discovery in Database,KDD). And
Association rules
Intrusion
Detection
Data warehouse
123
99
52
Neural network
Attribute
Reduction
Data mining
29
29
26
Knowledge
Management
51 Concept lattice 25 F. The third step is the orthogonal operator, in a 1
∧
a 2 =
Database
Support Vector
Machine
42
38
Spatial Data
Mining
Non-related literature
25
25
(1, 0, 0, 0, 0) ,Keywords a1, a2 is now the bit vector (1, 0, 0,
0, 0),and this means key words a1 and a2 together appear in the first articles, the vector mode is on behalf the
III.
CO-WORD ANALYSIS ALGORITHM
Co-word analysis method has experienced 20 years during the development, and the binary co-word analysis is generally appeared as the two-dimensional table, the useage of two-dimensional table is to form a co-word matrix,and co-word matrix is the basis for co-word analysis of the results.However, the use of co-word matrix is clearly unable to complete the ternary co-word analysis, which means binary co-word algorithm does not solve the problem which exists in the ternary co-word analysis.
The DLG algorithm (Direct Large Generation) [10] is a text-based mining algorithm with specific association rules, co-occurrence of frequency keywords a1 and a2. Based on the aboved method, the co-current list of all keywords is finally got. Bit vector of binary co-word analysis adopted the calculation of the algorithm which can be summarized as the following formula: i n
1 j n
i
1 a i
* a j
(1) and ternary co-words analysis algorithm is described as: n i
1 n j
i
1 k n
j
1 a i
* a j
* a k
(2) actually, the formula can also be expanded to use for multiple co-word analysis. the algorithm is built by scanning the data set and the associated map of all the items is to establish bit vectors, and then the bit vector orthogonal operator is carried out[10]. DLG algorithm is useful for co-word operation,
IV.
KNOWLEDGE DISCOVERY IN CO-WORD
ANALYSIS and the algorithm steps should include:1) scanning the whole literature and make sure that there may not include repeated keywords; 2) generating the bit vectors for all of the words; 3)finally,get all the keywords’ co-occurrence
Retrievling the research papers in the Chinese Core
Journals with the search terms “knowledge discovery”.In addition, the strength of ternary co-occurrence is much list.
The specific algorithm process is as follows: a literatur e set X= {A(a1,a2,a3,a5,a6),B(a2,a5,a6,a3,a4,a7),C(a1,a6, lower than the results generated by the binary word analysis, thus,choosing the words frequency which is higher than 3 for computing the strength of ternary word phrase, this can to ensure Knowledge Discovery phrase included in the a3,a10),D(a6,a5,a10,a1,a11,a9,a8),E(a10,a7,a5,a3,a1)} ,a nd eachA,B,C,D,E represents a specific literature with their own keywords, and each a1,a2,a3,a4,a5,a6,a7,a8,a9,a
10represents a keyword.
Through the first step which is computing all of the words={a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11},the second step is to generate all these bit vectors,which presented by
{a1(1,0,1,1,1),a2(1,1,0,0,0),a3(1,1,1,0,1),a4(0,1,0,0,0),a5( non-related literatures will appear in the final results. the results of binary co-word and ternary co-word list are exacting the co-word analysis algorithm which is based on the choices of bit vectors set and with the co –occurrence strength as the criteria,the top 20 results are shown in Table
Ⅱ.
1,1,0,1,1),a6(1,1,1,1,0),a7(0,1,0,0,1),a8(0,0,0,1,0),a9(0,0,
0,1,0),a10(0,0,1,1,1),a11(0,0,0,1,0)},and a1(1,0,1,1,1) means a same keyword exists in the literature A , C , D , E ,
T ABLE
Ⅱ
K NOWLEDGE D ISCOVERY RESEARCH TERNARY CO WORD ANALYSIS RESULTS ( THE FIRST 20 NON RELATED
LITERATURE )
Keyword
Binary co-word analysis
Keyword Co-occurrence strength
Data
Mining
Data
Association rules
72
KDD 60
Mining
Data
Mining
Data
Data warehouse
Rough Sets
36
33
Mining
Data
Mining
Database 24
Data
Mining
Association rules
Data
Mining
Data
Mining
Data
Mining
Ternary word analysis
Keyword Keyword Keyword
KDD
Frequent item sets
KDD
KDD
Association rules
Association rules
Incremental updating
Rough Sets
Data warehouse
Frequent item sets
Co-occurrence strength
7
7
6
6
6
KDD Association 22 Data Association Incremental 6
rules Mining rules updating
Intrusion
Detection
Data
Mining
Rough Sets
Intrusion
Detection
Data
Mining
Data
Mining
Association rules
Data
Mining
Data
Mining
Data
Mining
Data
Mining
Support
Vector
Machine
Intrusion
Detection
Attribute
Reduction
Network
Security
Decision
Tree
Machine learning
Frequent item sets
Clustering
E-commerce
Classification rules
Decision
Support
KDD KDD Rough Sets
Rough Sets Reduction
21
15
15
15
14
11
11
10
10
10
10
10
10
Data
Mining
Data
Mining
Frequent item sets
Association rules
Rough Sets Reduction
Data
Mining
Non-related literature
Genetic
Algorithms
Swanson
Swanson
9
5
As Table 2 Shown, china’s current research is focused on
Data
Mining
Data
Mining
Data
Mining
Intrusion
Detection
Web
Mining
Data
Mining
Data
Mining
Data
Mining
Data
Mining
Data
Mining
Data
Mining
Non-related literature
KDD
Association rules
Association rules
Network
Security
Information
Mining
Database
Frequent item sets
Algorithm design
Reasoning mechanism
KDD
Rough
Sets
Rough
Sets
Association rules
Intrusion
Detection
Data warehouse
Decision
Support
Neural network
Decision table
Concept lattice
Network
Security
Database
Arrowsmith
Incremental updating
Algorithm
Discernable matrix
Database
Swanson
6
5
5
4
4
4
4
4
3
3
3
3
3
3
3
And non-literature knowledge discovery ranked 16 in the areas which are corresponding to the research topics as the ternary co-word analysis,and the location is in the front
{Data Mining of KDD}, {data mining, data warehousing},
{data mining rough sets}.
Data mining is the core technology for knowledge discovery in database, and association rules is an important of result in the binary co-word analysis. The most high-frequency phrases which on behalf of the non-literature knowledge discovery is {Arrowsmith,
Swanson, non-related literature},and Arrowsmith is the condition in data mining,and ternary-word get {data mining
KDD association rules}, {association rules of frequent item sets incremental updates} as the representative of the high-frequency ternary phrase, compared with the binary-word,these triples phrase not only reflect the closeness of the research topics, and more direct reflect the trend of research area. Combination of background,the research of KDD is mainly focus on the mining of the items whose data frequently updated.
The other important research is the knowledge discovery among the non-related literatures.and the result of binary co-word analysis reveal that it is ranked 50 during the important research field of non-related literatures, represented by the word “literature” and
“Swanson” ,Swanson, professor who propose the non-related literature knowledge discovery method,and the first prototype system during the non-literature knowledge discovery,which is conducted by professor Swanson,the research topics which the phrase represent is the introduction of the non-literature knowledge discovery tools which invented by professor Swanson.
Contrast to results of the analysis of binary co-word and the ternary co-word, you can see the problems reflected by these two methods are not entirely consistent.
Results of binary co-word analysis is reflected the intimacy degree of the research themes which represented by the keywords, ternary co-word analysis presented the trend of research.
Between the reflected results which is represented by the directions or topics,and ternary co-word analysis is more specific than the binary co-word analysis.In addition, ternary co-word analysis can get the results which binary co-word analysis can not reveal. research topics of this group of words indicating that china’s non-literature knowledge is basically at the introduction stage, on the one hand this is because the non-related literature was introduced into China relatively late [11],Also the non-related literatures which in the field of information science, and china’s intelligence studies and education are carried out relatively late.
V.
COORDINATE ANALYSIS OF TERNARY WORDS
Binary co-word analysis has varieties visual analysis methods, including the dendrogram visualization, social network structure, strategy and plot, but these methods are
based on binary co-word analysis, it is clearly not suitable for ternary co-word analysis. However, visual analysis is an easy-to-understand, convenient and direct analysis method, and contribute to a more complete and clear understanding of the research findings. This experiment through the two index which means stability and influence,trying to use coordinate diagram to carry out the visual analysis of the ternary co-word results.
Stability (S ijk
) refers to the stability of ternary phrase co-occurrence frequency, and followed by the average value of the ratio of co-occurrence frequency and frequency of each keyword. The formula is
S ijk
=((C ijk
/C i
)+(C ijk
/C j
)+(C ijk
/C k
))/3=C ijk
(C i
C j
+C j
C k
+C k
C i
)/
3C ijk
(C i
C j
+C j
C k
+C k
C i
)/3C i
C j
C k
.
Influence (I ijk
) means influence of all the phrases to the analysis results,reflected in the experiment which the ratio of the sum of the frequency revels in three keywords of each phrases and the sum of all the ternary phrase. The formula for the I ijk
=(CR i
+CR j
+CR k
)/TR, where CR i
,CR j
,
CR k
denote the frequency of the three keywords i, j, k in a phrase, TR means that the number of all ternary phrase.
Range B have a higher stability,and a weaker influence,
Range B contains the phrase on behalf of literature
Knowledge Discovery {non-ralated literature, Arrowsmith,,
Swanson,}, the frequency of each keyword in this phrase is
{25, 6 , 6}, and the coordinates of influence and stability is
(0.016, 0.373);the phrases of Range C is both weak in stability and influence,for example, the phrase which is on behalf of the use of knowledge discovery methods to resolve the issue of network security,{intrusion detection, support vector machines, network security}, the frequency of each keyword in this phrase is {99, 38, 21}, and the coordinates of is influence and stability (0.124, 0.084); the phrase in the Range D is both high in the influence and stability,but ,in fact, there is no phrase in this interval;
Range E has a relative higher stability and really high influence,for example,the phrase which on behalf of using data mining to provide personalized service {Data Mining of KDD, personality analysis engine}, the frequency of each keyword in this phrase is {520, 163 , 2}, now the coordinates of influence and stability is (0.442, 0.339);and the Range F has a high influence,but low stability,such as the phrase which is on behalf of the mining algorithm of the association rule,{data mining, rough sets, neural networks}, the frequency of each keyword in this phrase are {520, 161,
51 }, a coordinates of influence and stability is (0, 461,
0.042), although the phrase has a high influence, but the stability of co-occurrence is very poor.
Fig.1 the Stability and Influence of ternary phrase (top 50)
Selecting top 50 ternary phrase in the co-occurrence frequency,and visualize them in the coordinate diagram, the stability of the Y-axis the vertical axis, the influence of the
X-axis horizontal, and generate a plot (Figure 1).
As shown in Figure 1, the stability and influence is inversely proportional,the most influence coordinate is
(0.550, 0.0 49), it represents the phrase {Data Mining of
KDD, rough set},from figure 1,you can see the stability of this phrase is very low;and the most satble coordinate is
(0.002, 0.833), it represents the phrase {Chinese Herbal
Medicine, the standard amount,Chinese Pharmacopoeia}, but the influence of this phrase is very low.
In addition, from the plot to view the top 50 phrases is in the 6 distribution of the general range: the phrases in Range
A have a high degree of stability,but the influence is weak, two points in the range representing the phrase {Rough Sets,
Ordered Information Systems, roughness}, and the frequency of each keyword in the phrase is {161, phrase, 3}, the co-current frequency is 3, and the coordinates of influence and stability is (0.125, 0.673); and the phrases in
VI.
CONCLUSION
The paper explore and propose the ternary co-words analysis of the algorithm, But in experiments, there are some disadvantages:1) during the data normalization process, only the high frequency keywords were standardized and uniform, there is no low-frequency keywords, because the ternary phrase co-current strength is lower, which the final accuracy of the results have a larger impact on the influence of choice; 2) ternary term results of the analysis methods remains to be further research to be more useful; 3) The test only explore the ternary co-word analysis, the selected method in the experiment is also applicable to multi-co-word analysis and has yet to be further explored.
REFERENCES
[1] Feng Lu, cold-volt sea. Co-word analysis of methods of theoretical developments [J]. Library Science in China,
2006, 32 (2): 88 - 92.
[2] Lee B, Jeong YI. Mapping Korea 's national R & D domain of robot technology by using the co-word analysis. S cientoMetrics , 2008 , 77 ( 1 ): 3 - l 9 . S cientoMetrics, 2008,
77 (1): 3 - l 9.
[3] JIANG Chun, Du Weibin, Lijiang Bo. Economics hot areas of knowledge mapping: co-word analysis perspective [J].
Journal of Information, 2008, 27 (9): 78 - 80 . 78 - 80.
[4] Bredillet. Investigating the Future of Project Management: a co-word analysis approach . Proceedings of IRNOP VII
Project Research Conference . 2006 : 477 - 497 . co-word analysis approach. Proceedings of IRNOP VII Project
Research Conference. 2006: 477 - 497.
[5] Rokaya M, Atlam E, Fuketa M, Dorji TC, Aoe JI. Ranking of field association terms using Co-word analysis.
Information Processing & Management. 2008 , 44 ( 2 ): 738
- 755 . Information Processing & Management, 2008, 44
(2): 738 - 755.
[6] Liu Suqin, when read cloud, Hsu nine Yun, Ye Feiyue database knowledge discovery research [J]. Surface engineering of oil and gas fields, 2003 22 ( 4 ) : 54 - 55 . 22
(4): 54 - 55.
[7] Kuang Ping sword. Data mining and knowledge discovery review [J]. Modern computers (Professional Edition), 2002,
(6): 13 - 17.
[8] Swanson, the DR. Fish Oi l , Raynaud ' s Syndrome.Pub lic
Knowledge. Perspectives in Biology and Medicine , 1986 ,
30 ( 1 ) : 7 - l8 . Perspectives in Biology and Medicine, 1986,
30 (1): 7 - l8.
[9] Fuhai Liu.progress based on knowledge of the literature found that the application [J]. Intelligence Journal, 2006, 25
(6): 700 - 712.
[10] Arbee LPChen. An efficient approach to discovering knowledge from large databases. P roceedings of the fourth international conference on Parallel and distributed information systems. 1996: 8 - 18.
[11] Ma Ming, Wuyishan.The information science methodological significance of academic achievement and enlightenment [J]. Intelligence Journal, 2003, (3): 259-266.