1020607

advertisement

Study of Ternary Co-words Analysis in China’s Knowledge Discovery Area

Fei Yan

1

, Lin Wang

2

1 Department of Information Management,Wuhan University, wuhan, China

2 National Science Library, Chinese Academy of Sciences, Beijing, China

(fyan@whu.edu.cn)

Abstract - Reference to the achievement of domestic and international co-word analysis, learning from DLG association mining algorithm, and propose the ternary co-word analysis algorithm that based on bit vector and the map coordinates analysis method of the ternary Co-word results. As an empirical, domestic knowledge discovery journal papers key words are studied, and this study finds the results of ternary word has some practical significance,and also,the ternary co-word analysis has some practical value.

Keywords co-word analysis, co-occurrences, knowledge discovery knowledge discovery in the introduction of the concept was first mentioned in the field of data mining

[6,7],therefore,knowledge discovery usually refers specifically to the knowledge discovery in database.Almost simultaneously generated a knowledge discovery method which is based on non-related literatures (Literature-Based

Discovery of LBD) [8,9].

Knowledge discovery is in order to find reliable, useful knowledge,which is not yet known and can be non-trivial for people to understand the rules and mode as the core method of data mining and data processing.The paper focuses on the design of data mining algorithms. And

I.

INTRODUCTION knowledge discovery among non-ralated literatures is based on the relationship of scientific literatures which are

Co-word analysis is a method of text analysis,which is based on the text’s content, it was first described in the not co-citation or co-occurrence between the mining objects. literatures in 1970s.and until now,there has been experienced three generations of co-word analysis method,

B.The establishment of data sets and specifications the first generation is based on the co-word analysis of the inclusive index and the adjacent index, the second generation is based on the co-word analysis of strategic

The experiment found knowledge exists in research papers which publish on the Chinese Core Journals and coordinates method, and the third generation took the structure of the database content analysis as the co-word make them as the source data, retrieving the CNKI database to select the search path,which is {(theme = knowledge analysis method [1]. and in this paper, co-word analysis is used to analyze the theme and perspective areas[3] of the discovery) AND core journals}, the retrieval date is July 13,

2010, and finally returned 1764 results. research areas through the study of professional literature in a field with the keywords’ co-occurrence distribution structure[2], and during the process of development ,changes and trends in the field [4], optimizing the information retrieval results [5].

So far, the co-word analysis study is all about the method of binary word co-occurrence phenomenon analysis,not yet

Actually,the 1764 results with 6810 keywords and the average amount of keywords is 3.9 for each paper.but these key words maybe the synonyms (eg, data mining and data- mining), such as english keywords (eg.the Association, rules), english abbreviations (eg. KDD, the SVM, the IDS, etc.), and some other various forms of keywords.so there is a need to do a simple standardized, unified alternative discovered the phenomenon of multi-phrase co-occurrence. this paper take china’s knowledge discovery research area words to replace the different forms of keywords, and as a result of the processing, we could rank the top 20 as a case study, try the method of special case,which is the ternary co-word co-occurrence study of multi-component high-frequency keywords in Table Ⅰ.

T ABLE

HIGH FREQUENCY KEYWORDS ( TOP 20) words, and focuses on the design of ternary words co-occurrence algorithm and ternary co-word analysis Keyword Frequency Keyword Frequency method.

Knowledge

Discovery

564

Genetic

Algorithms

38

Data Mining 485 Clustering 36

II.

DATA SET

KDD KDD 163 Decision Tree 35

Rough Sets 161 Machine learning 31

A.Select a Test Field

Knowledge Discovery is the process of discovering new knowledge from the known mass data by some means or technology(Knowledge Discovery in Database,KDD). And

Association rules

Intrusion

Detection

Data warehouse

123

99

52

Neural network

Attribute

Reduction

Data mining

29

29

26

Knowledge

Management

51 Concept lattice 25 F. The third step is the orthogonal operator, in a 1

a 2 =

Database

Support Vector

Machine

42

38

Spatial Data

Mining

Non-related literature

25

25

(1, 0, 0, 0, 0) ,Keywords a1, a2 is now the bit vector (1, 0, 0,

0, 0),and this means key words a1 and a2 together appear in the first articles, the vector mode is on behalf the

III.

CO-WORD ANALYSIS ALGORITHM

Co-word analysis method has experienced 20 years during the development, and the binary co-word analysis is generally appeared as the two-dimensional table, the useage of two-dimensional table is to form a co-word matrix,and co-word matrix is the basis for co-word analysis of the results.However, the use of co-word matrix is clearly unable to complete the ternary co-word analysis, which means binary co-word algorithm does not solve the problem which exists in the ternary co-word analysis.

The DLG algorithm (Direct Large Generation) [10] is a text-based mining algorithm with specific association rules, co-occurrence of frequency keywords a1 and a2. Based on the aboved method, the co-current list of all keywords is finally got. Bit vector of binary co-word analysis adopted the calculation of the algorithm which can be summarized as the following formula: i n  

1 j n

 i

1 a i

* a j

(1) and ternary co-words analysis algorithm is described as: n i

1 n j

 i

1 k n   

 j

1 a i

* a j

* a k

(2) actually, the formula can also be expanded to use for multiple co-word analysis. the algorithm is built by scanning the data set and the associated map of all the items is to establish bit vectors, and then the bit vector orthogonal operator is carried out[10]. DLG algorithm is useful for co-word operation,

IV.

KNOWLEDGE DISCOVERY IN CO-WORD

ANALYSIS and the algorithm steps should include:1) scanning the whole literature and make sure that there may not include repeated keywords; 2) generating the bit vectors for all of the words; 3)finally,get all the keywords’ co-occurrence

Retrievling the research papers in the Chinese Core

Journals with the search terms “knowledge discovery”.In addition, the strength of ternary co-occurrence is much list.

The specific algorithm process is as follows: a literatur e set X= {A(a1,a2,a3,a5,a6),B(a2,a5,a6,a3,a4,a7),C(a1,a6, lower than the results generated by the binary word analysis, thus,choosing the words frequency which is higher than 3 for computing the strength of ternary word phrase, this can to ensure Knowledge Discovery phrase included in the a3,a10),D(a6,a5,a10,a1,a11,a9,a8),E(a10,a7,a5,a3,a1)} ,a nd eachA,B,C,D,E represents a specific literature with their own keywords, and each a1,a2,a3,a4,a5,a6,a7,a8,a9,a

10represents a keyword.

Through the first step which is computing all of the words={a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11},the second step is to generate all these bit vectors,which presented by

{a1(1,0,1,1,1),a2(1,1,0,0,0),a3(1,1,1,0,1),a4(0,1,0,0,0),a5( non-related literatures will appear in the final results. the results of binary co-word and ternary co-word list are exacting the co-word analysis algorithm which is based on the choices of bit vectors set and with the co –occurrence strength as the criteria,the top 20 results are shown in Table

Ⅱ.

1,1,0,1,1),a6(1,1,1,1,0),a7(0,1,0,0,1),a8(0,0,0,1,0),a9(0,0,

0,1,0),a10(0,0,1,1,1),a11(0,0,0,1,0)},and a1(1,0,1,1,1) means a same keyword exists in the literature A , C , D , E ,

T ABLE

K NOWLEDGE D ISCOVERY RESEARCH TERNARY CO WORD ANALYSIS RESULTS ( THE FIRST 20 NON RELATED

LITERATURE )

Keyword

Binary co-word analysis

Keyword Co-occurrence strength

Data

Mining

Data

Association rules

72

KDD 60

Mining

Data

Mining

Data

Data warehouse

Rough Sets

36

33

Mining

Data

Mining

Database 24

Data

Mining

Association rules

Data

Mining

Data

Mining

Data

Mining

Ternary word analysis

Keyword Keyword Keyword

KDD

Frequent item sets

KDD

KDD

Association rules

Association rules

Incremental updating

Rough Sets

Data warehouse

Frequent item sets

Co-occurrence strength

7

7

6

6

6

KDD Association 22 Data Association Incremental 6

rules Mining rules updating

Intrusion

Detection

Data

Mining

Rough Sets

Intrusion

Detection

Data

Mining

Data

Mining

Association rules

Data

Mining

Data

Mining

Data

Mining

Data

Mining

Support

Vector

Machine

Intrusion

Detection

Attribute

Reduction

Network

Security

Decision

Tree

Machine learning

Frequent item sets

Clustering

E-commerce

Classification rules

Decision

Support

KDD KDD Rough Sets

Rough Sets Reduction

21

15

15

15

14

11

11

10

10

10

10

10

10

Data

Mining

Data

Mining

Frequent item sets

Association rules

Rough Sets Reduction

Data

Mining

Non-related literature

Genetic

Algorithms

Swanson

Swanson

9

5

As Table 2 Shown, china’s current research is focused on

Data

Mining

Data

Mining

Data

Mining

Intrusion

Detection

Web

Mining

Data

Mining

Data

Mining

Data

Mining

Data

Mining

Data

Mining

Data

Mining

Non-related literature

KDD

Association rules

Association rules

Network

Security

Information

Mining

Database

Frequent item sets

Algorithm design

Reasoning mechanism

KDD

Rough

Sets

Rough

Sets

Association rules

Intrusion

Detection

Data warehouse

Decision

Support

Neural network

Decision table

Concept lattice

Network

Security

Database

Arrowsmith

Incremental updating

Algorithm

Discernable matrix

Database

Swanson

6

5

5

4

4

4

4

4

3

3

3

3

3

3

3

And non-literature knowledge discovery ranked 16 in the areas which are corresponding to the research topics as the ternary co-word analysis,and the location is in the front

{Data Mining of KDD}, {data mining, data warehousing},

{data mining rough sets}.

Data mining is the core technology for knowledge discovery in database, and association rules is an important of result in the binary co-word analysis. The most high-frequency phrases which on behalf of the non-literature knowledge discovery is {Arrowsmith,

Swanson, non-related literature},and Arrowsmith is the condition in data mining,and ternary-word get {data mining

KDD association rules}, {association rules of frequent item sets incremental updates} as the representative of the high-frequency ternary phrase, compared with the binary-word,these triples phrase not only reflect the closeness of the research topics, and more direct reflect the trend of research area. Combination of background,the research of KDD is mainly focus on the mining of the items whose data frequently updated.

The other important research is the knowledge discovery among the non-related literatures.and the result of binary co-word analysis reveal that it is ranked 50 during the important research field of non-related literatures, represented by the word “literature” and

“Swanson” ,Swanson, professor who propose the non-related literature knowledge discovery method,and the first prototype system during the non-literature knowledge discovery,which is conducted by professor Swanson,the research topics which the phrase represent is the introduction of the non-literature knowledge discovery tools which invented by professor Swanson.

Contrast to results of the analysis of binary co-word and the ternary co-word, you can see the problems reflected by these two methods are not entirely consistent.

Results of binary co-word analysis is reflected the intimacy degree of the research themes which represented by the keywords, ternary co-word analysis presented the trend of research.

Between the reflected results which is represented by the directions or topics,and ternary co-word analysis is more specific than the binary co-word analysis.In addition, ternary co-word analysis can get the results which binary co-word analysis can not reveal. research topics of this group of words indicating that china’s non-literature knowledge is basically at the introduction stage, on the one hand this is because the non-related literature was introduced into China relatively late [11],Also the non-related literatures which in the field of information science, and china’s intelligence studies and education are carried out relatively late.

V.

COORDINATE ANALYSIS OF TERNARY WORDS

Binary co-word analysis has varieties visual analysis methods, including the dendrogram visualization, social network structure, strategy and plot, but these methods are

based on binary co-word analysis, it is clearly not suitable for ternary co-word analysis. However, visual analysis is an easy-to-understand, convenient and direct analysis method, and contribute to a more complete and clear understanding of the research findings. This experiment through the two index which means stability and influence,trying to use coordinate diagram to carry out the visual analysis of the ternary co-word results.

Stability (S ijk

) refers to the stability of ternary phrase co-occurrence frequency, and followed by the average value of the ratio of co-occurrence frequency and frequency of each keyword. The formula is

S ijk

=((C ijk

/C i

)+(C ijk

/C j

)+(C ijk

/C k

))/3=C ijk

(C i

C j

+C j

C k

+C k

C i

)/

3C ijk

(C i

C j

+C j

C k

+C k

C i

)/3C i

C j

C k

.

Influence (I ijk

) means influence of all the phrases to the analysis results,reflected in the experiment which the ratio of the sum of the frequency revels in three keywords of each phrases and the sum of all the ternary phrase. The formula for the I ijk

=(CR i

+CR j

+CR k

)/TR, where CR i

,CR j

,

CR k

denote the frequency of the three keywords i, j, k in a phrase, TR means that the number of all ternary phrase.

Range B have a higher stability,and a weaker influence,

Range B contains the phrase on behalf of literature

Knowledge Discovery {non-ralated literature, Arrowsmith,,

Swanson,}, the frequency of each keyword in this phrase is

{25, 6 , 6}, and the coordinates of influence and stability is

(0.016, 0.373);the phrases of Range C is both weak in stability and influence,for example, the phrase which is on behalf of the use of knowledge discovery methods to resolve the issue of network security,{intrusion detection, support vector machines, network security}, the frequency of each keyword in this phrase is {99, 38, 21}, and the coordinates of is influence and stability (0.124, 0.084); the phrase in the Range D is both high in the influence and stability,but ,in fact, there is no phrase in this interval;

Range E has a relative higher stability and really high influence,for example,the phrase which on behalf of using data mining to provide personalized service {Data Mining of KDD, personality analysis engine}, the frequency of each keyword in this phrase is {520, 163 , 2}, now the coordinates of influence and stability is (0.442, 0.339);and the Range F has a high influence,but low stability,such as the phrase which is on behalf of the mining algorithm of the association rule,{data mining, rough sets, neural networks}, the frequency of each keyword in this phrase are {520, 161,

51 }, a coordinates of influence and stability is (0, 461,

0.042), although the phrase has a high influence, but the stability of co-occurrence is very poor.

Fig.1 the Stability and Influence of ternary phrase (top 50)

Selecting top 50 ternary phrase in the co-occurrence frequency,and visualize them in the coordinate diagram, the stability of the Y-axis the vertical axis, the influence of the

X-axis horizontal, and generate a plot (Figure 1).

As shown in Figure 1, the stability and influence is inversely proportional,the most influence coordinate is

(0.550, 0.0 49), it represents the phrase {Data Mining of

KDD, rough set},from figure 1,you can see the stability of this phrase is very low;and the most satble coordinate is

(0.002, 0.833), it represents the phrase {Chinese Herbal

Medicine, the standard amount,Chinese Pharmacopoeia}, but the influence of this phrase is very low.

In addition, from the plot to view the top 50 phrases is in the 6 distribution of the general range: the phrases in Range

A have a high degree of stability,but the influence is weak, two points in the range representing the phrase {Rough Sets,

Ordered Information Systems, roughness}, and the frequency of each keyword in the phrase is {161, phrase, 3}, the co-current frequency is 3, and the coordinates of influence and stability is (0.125, 0.673); and the phrases in

VI.

CONCLUSION

The paper explore and propose the ternary co-words analysis of the algorithm, But in experiments, there are some disadvantages:1) during the data normalization process, only the high frequency keywords were standardized and uniform, there is no low-frequency keywords, because the ternary phrase co-current strength is lower, which the final accuracy of the results have a larger impact on the influence of choice; 2) ternary term results of the analysis methods remains to be further research to be more useful; 3) The test only explore the ternary co-word analysis, the selected method in the experiment is also applicable to multi-co-word analysis and has yet to be further explored.

REFERENCES

[1] Feng Lu, cold-volt sea. Co-word analysis of methods of theoretical developments [J]. Library Science in China,

2006, 32 (2): 88 - 92.

[2] Lee B, Jeong YI. Mapping Korea 's national R & D domain of robot technology by using the co-word analysis. S cientoMetrics , 2008 , 77 ( 1 ): 3 - l 9 . S cientoMetrics, 2008,

77 (1): 3 - l 9.

[3] JIANG Chun, Du Weibin, Lijiang Bo. Economics hot areas of knowledge mapping: co-word analysis perspective [J].

Journal of Information, 2008, 27 (9): 78 - 80 . 78 - 80.

[4] Bredillet. Investigating the Future of Project Management: a co-word analysis approach . Proceedings of IRNOP VII

Project Research Conference . 2006 : 477 - 497 . co-word analysis approach. Proceedings of IRNOP VII Project

Research Conference. 2006: 477 - 497.

[5] Rokaya M, Atlam E, Fuketa M, Dorji TC, Aoe JI. Ranking of field association terms using Co-word analysis.

Information Processing & Management. 2008 , 44 ( 2 ): 738

- 755 . Information Processing & Management, 2008, 44

(2): 738 - 755.

[6] Liu Suqin, when read cloud, Hsu nine Yun, Ye Feiyue database knowledge discovery research [J]. Surface engineering of oil and gas fields, 2003 22 ( 4 ) : 54 - 55 . 22

(4): 54 - 55.

[7] Kuang Ping sword. Data mining and knowledge discovery review [J]. Modern computers (Professional Edition), 2002,

(6): 13 - 17.

[8] Swanson, the DR. Fish Oi l , Raynaud ' s Syndrome.Pub lic

Knowledge. Perspectives in Biology and Medicine , 1986 ,

30 ( 1 ) : 7 - l8 . Perspectives in Biology and Medicine, 1986,

30 (1): 7 - l8.

[9] Fuhai Liu.progress based on knowledge of the literature found that the application [J]. Intelligence Journal, 2006, 25

(6): 700 - 712.

[10] Arbee LPChen. An efficient approach to discovering knowledge from large databases. P roceedings of the fourth international conference on Parallel and distributed information systems. 1996: 8 - 18.

[11] Ma Ming, Wuyishan.The information science methodological significance of academic achievement and enlightenment [J]. Intelligence Journal, 2003, (3): 259-266.

Download