ICI 10 - 10TH INTERNATIONAL CONFERENCE ON INFORMATION AT DELTA UNIVERSITY FOR SCIENCE AND TECHNOLOGY GAMASA, EGYPT, 4 – 6 DECEMBER 2010 Automatic abstract preparation Dr. Tünde Molnár Lengyel Associate professor, Eszterházy Károly College, Department of Informatics, Hungary email: mtunde@ektf.hu Abstract (Abstract Text; 11 pt Times New Roman, Maximum 200 words, justified, single spaced lines) As the rapidly multiplying scientific information condemns any researcher’s effort to keep up with the latest development in his or her own field to futility the automatization of content exploration gains increasing significance. Consequently, any research or project helping to attract attention to or pointing out the essential information of a given scholarly text becomes indispensable. While in part due to the respective intricate linguistic structures the automatization of this task has not been fully achieved in Hungary, the urgency of the implementation of such project is not diminished. Presently several programs facilitate the distinction of essential information in English language texts via the utilization of quantitative or qualitative principles in addition to the available help provided for researchers of German, French, Greek, Spanish, Italian, and Chinese texts. Consequently, my research project aiming to develop an automatized Hungarian language text extraction program attempts to remedy the abovementioned situation. In my presentation I would like to examine the possibilities and limitations of the automation of research abstract preparation in addition to surveying the procedures and technological implementations facilitating the compilation of research abstracts in the Hungarian language. Keywords: Automatic abstract preparation; content analysis, quantitative programme, qualitative programme 1. Introduction As a result of an ever-increasing availability of new information the automation of documentary content exploration is steadily gaining importance. Since the virtually unprocessable amount of information makes keeping up with the latest professional developments even in our field, library science and information management, rather difficult, research projects aiming at promoting the selection and identification of appropriate information assume a vital significance. One possible solution to the above quandary, 1 Paper Title ICI 10 - 10TH INTERNATIONAL CONFERENCE ON INFORMATION automation can yield several benefits including improved efficiency of information search and retrieval systems, newly developed Internet search engines, in addition to the Internet-based news monitoring systems. While information retrieval appears to be crucial, issues concerning information processing have to be addressed as well. Such potential questions have to be answered as what should be done after identifying the respective information, can after entering a given web-page or obtaining the electronic version of a needed article or book the user be left on his or her own, or be offered assistance? The thorough reading of the given text is rather time-consuming and if we consider the ever-increasing number of publications we can see that an instructor or researcher aiming to keep up with the latest developments in his or her field will be compelled to spend much more time reading these materials than it is practically feasible. The requirements of one’s profession or the expectations of society also lead to an increase in the number of articles as the evaluation and potential advancement of a college instructor or academic researcher in any discipline are based upon the number of articles published and lectures held at scholarly conferences. Consequently, simultaneously with the growing number of publications the articles’ novelty content declines and redundancies along with overlaps also increase. This problem is not discipline-specific as the ever-growing number of professional publications leads to the decline of novelty or new information capacity and increases the potential of overlaps and redundancies. 2. Methods (Experimental / Theoretical) (12pt Times New Roman, Bold) Therefore, the role of research abstracts becomes more and more significant as by the help of the former the reader uses the tenth of the time in obtaining relevant, or (theoretically) the most important information compared to the full reading of the given text. Furthermore, the compilation of research abstracts also helps in the elimination of redundancies. While in my opinion, the significance and importance of research abstracts are beyond any dispute, no one can be expected to read through all the periodicals containing research abstracts and each respective excerpt However, research abstracts allow one to gain more information relevant to his or her field and another important benefit is that the reading of these abstracts helps one to decide whether the full reading of the given article is necessary. or not So far we have examined the importance of research abstracts from a user’s point of view and even after a brief glance we can safely conclude that the overview of the latter poses a significant challenge in most disciplines. When we look at the other side of the coin, the compilation of research abstracts, we have to contend with difficulties in that respect as well. 2 Paper Title ICI 10 - 10TH INTERNATIONAL CONFERENCE ON INFORMATION The ever proliferating publications make the work of abstract and documentation compilers more and more arduous as traditional methods cannot yield a comprehensive or approximate reproduction of the given materials in the respective fields. A greater reliance on the computer in the work of the documentalist appears to be the only solution, as more and more procedures should be elaborated to facilitate the presentation of the most important elements of articles, or possibly books. While nowadays leading scientific publications require authors to provide abstracts or summaries to their submitted articles, only a very few articles are published in this format, and the abstract preparation potential of documentation experts is limited due to other professional commitments. Consequently, any research or project helping to attract attention to or pointing out the essential information of a given scholarly text becomes indispensable. While in part due to the respective intricate linguistic structures the automatization of this task has not been fully achieved in Hungary, the urgency of the implementation of such project is not diminished. ”In light of the wide-spread applicability and popularity of abstracts, the need to automatize this process is all the more justified”. Due to the usability and popularity of text extracts the automatization of the respective representational process appears to be natural.”[1] Presently several programs facilitate the distinction of essential information in English language texts via the utilization of quantitative or qualitative principles (Supplement 1) in addition to the available help provided for researchers of German(1), French(2), Greek(3), Spanish(4), Italian(5), and Chinese(6) texts. Consequently, my research project aiming to develop an automatized Hungarian language text extraction program attempts to remedy the abovementioned situation. In the following section I would like to examine the possibilities and limitations of the automation of research abstract preparation in addition to surveying the procedures and technological implementations facilitating the compilation of research abstracts in the Hungarian language. Content analysis While several theoretical approaches have been put forth for content analysis purposes including the manual process developed in 1963[2], despite the call of the 2001 edition of the Librarians’ Reference Book[3], for the elaboration of automatic Hungarian language extract programs this goal has not yet been fully realized Look at the some options for the automatization of extract preparation! Content extraction software is usually grouped according to its ability to facilitate quantitative or qualitative 3 Paper Title ICI 10 - 10TH INTERNATIONAL CONFERENCE ON INFORMATION content analysis. However, no clear distinction can be made between the two approaches. All software capable of performing qualitative content analysis provide quantity-based reports and the qualitative approach is primarily indicated by provision of memoranda and annotations..My research leads me to conclude that most programs do not go beyond statistical reporting, which on the one hand, helps researchers to make conclusions or perform analyses, but on the other, very few applications facilitate content extraction. While my research effort can be grouped into the category of quantity-based content analysis, in addition to focusing on statistical data I aim to discern the essential components of particular texts. The first step for the automation of abstract preparation is ascertaining how many times do certain words (treated as separate units) appear in the text Then the given set of data is organized according to frequency and thus the statistical verbal rendering of the text is obtained. Frequency analysis It was Zipf who discovered a certain regularity in the distribution of the words and structures of the text. He examined Joyce’s Ulysses, and having arranged the words of the novel according to occurrence, he asserted ’’that the product of the cumulative occurrence figures and the inclusive frequency values is constant.’’[4] In order to perform a frequency analysis in the text, the roots of the words of the text (type) along with the different forms of occurrence (sign) have to be identified and the occurrence will be ranked according to frequency. Whereas the identification of roots is a painstakingly long effort calling for the inclusion of the computer in the work process, in case of Hungarian texts this is the most difficult task. One possible solution could be computer-assisted linguistics, which in Hungary started in 1960 with the introduction of mechanized translations. This period was primarily characterized by the elaboration of the foundations of the mechanized translation algorithm from Russian to Hungarian language. The second era (1967-71) is defined by the work of documentary linguistics experts elaborating a syntactic analysis method of their own. The third lexicologic phase (1972-78) responding to the needs of literary critics or philologists included the development of software for language teaching and the compilation of quantitative-analysis based frequency dictionaries focusing on colloquial and literary Hungarian. However, these research results were so closely associated with certain scholars that the disbanding of the Documentation Group in 1972 in Budapest all but eliminated linguistics research efforts. With 1979 the fourth stage begins, experiencing an attempt made to make up for the loss of research 4 ICI 10 - 10TH INTERNATIONAL CONFERENCE ON INFORMATION Paper Title experience in the 1970’s. As a result of a dynamic development of language processing systems throughout Europe, Hungarian researchers elaborate the MI language culminating in the arrival of a Hungarian morphological analysis application method. The appearance of personal software in the 1990’s lead to rapid developments. The elaboration of a spelling and grammar check system taking into consideration of the characteristics of the Hungarian language was a significant achievement of this period. In this system the composition of words, that is the connection between the root and the suffix, was described by an algorithm. Morphologic, the firm responsible for its development, became one of the leading companies in the computerized linguistics field after Microsoft purchased their program. The newer versions of this software examines the context and can eliminate the irrelevant interpretations.[5] Today more and more Hungarian institutions achieve world wide reputation due to their computerized linguistics efforts. For example, the ILP, or Inductive Logic Programming developed in the Artificial Intelligence Research Laboratory of the Hungarian Academy of Science and of Szeged University played a pioneering role in the introduction of experimental linguistic applications. While the abovementioned results help in the identification of roots of words in the Hungarian language, they have not yet been applied in the field of library science. Following the identification of roots, the counting of the words contained by the given text is necessary for the frequency analyses, a task which can be carried out via simple programming commands. In order to perform a frequency analysis or prepare an abstract or excerpt, the significant expressions contained by the given text have to be identified. Methods Consequently I propose two solution methods. The definition of significant words can be achieved by the use of Luhn’s principle.[6] ”While the textual statistics approach utilizing the Luhn method has provided the most reliable results until now,” the definition of the significant words can be done by the vocabulary or dictionary method as well. 1. Following Zipf’s laws, significant expressions constitute the given domain of the frequency list which is dependent upon the respective discipline. However, it is true in all disciplines that these expression do not constitute the beginning or the end of the given list. The list of significant words is obtained when the Gauss curve 5 ICI 10 - 10TH INTERNATIONAL CONFERENCE ON INFORMATION Paper Title defined according to the empirical method characteristic of the given discipline is projected on to the frequency distribution function.[7] As far as Hungarian texts are concerned few disciplines elaborated a frequency dictionary facilitating the construction of the Gauss curve. Although word frequency dictionaries have been developed in Hungary,[8] these are not disciplinespecific. Nevertheless, computerized linguistics research and dictionary compilation efforts[9], can promise solutions for the future. 2. Luhn theorem is non-forbidden words occurring more than three times are considered significant. According to Luhn’s notion, elaborated in 1951, the multiple occurrence of certain doublets or triple word constructs can be helpful in computerassisted identification of terms carrying relevant information.. Having omitted trivial expressions, the weighting of adjacent words and triple word constructs help to identify relevant sections of the text. Consequently non-trivial expressions containing two or more units get a higher weighting value than their counterparts appearing only once in the text. Having established the weighting process, one has to define which units he or she wants to retrieve as relevant location, either in the form of a sentence or a paragraph Subsequently, the automation process starts during which a numerical value is assigned to the chosen unit based upon the weighting process and the sentences and paragraphs reflecting the highest numerical value are retrieved as a result. In my view the combination of both approaches appears to be the best answer. Accordingly, in case of texts covering an area in which word frequency dictionaries are available those dictionary compilations should be used for that purpose, and in lieu of the word frequency dictionary Luhn’s method should be applied. The latter method identifies and defines the significant expressions via the analysis of the frequency of the words in the given text. It is vital to determine the structure or components of the resulting extract. While I recommend the preservation of the most important sentences, this method raises several problems as well: The references in the particular sentences have to be eliminated in order to guarantee the intelligibility of the independent extracts. (if necessary linguistic approaches should be used); it is important to focus on the intratextual position of the sentence, attention should be paid to the potential introductory expressions[10] with which the author aimed to stress the given sentence (the collection and compilation of expressions into a dictionary is feasible if the examination of manual text extraction 6 ICI 10 - 10TH INTERNATIONAL CONFERENCE ON INFORMATION Paper Title substantiates this hypothesis, that is why in my view the examination of the main features of manual extraction by assessing the features and exploring the principal dynamics of manual extraction, and use up the results on the autimatization.) From the very beginning of the research process I was compelled to search for an answer to one of the most intriguing quandaries, the possibility of establishing rules and guidelines for human or non-automated abstract preparation. In order to answer this question I tested the summary preparation capability of students majoring in information management or library science informatics programs at various higher education institutions, of informatics experts and of students majoring in Hungarian language and literature. The participants of this survey were asked to prepare the abstracts of professional articles focusing on various topics. On the basis of result is worthy research this topic. I chose scientific articles focusing on different topics as a subject of the survey. One treatise followed a more interdisciplinary approach while the other treatise exclusively focused on library science. Moreover, for the purpose of analysing the correlation between the subject of the article and the knowledge of the sample group, participants not possessing an informatics background were also involved in the survey. Having examined the abstracts of the article focusing on library science I expect to discern a greater divergence between the outputs of professional and lay sample groups, than in that of the case of the interdisciplinary treatise. The group of people with non-informatics background contains students majoring in Hungarian language and literature enabling me to examine which plays a more important role in abstract preparation, professional knowledge or procedural skills. In order to examine whether the different skills and professional levels of sample groups lead to varying outputs in the abstract preparation process, not only students, but experts represented the Information Science-Library Science field in the survey. In order to examine whether the type of higher education institution in which the student is enrolled impacts the abstract preparation process I analysed the composition of the student sample groups. 3. Results (12 pt Times New Roman, Bold) My results is as follows: In case of both articles a strong correlation can be established between the abstracts prepared by the student sample groups, while compared to the professionally prepared abstracts this correlation is lower, but in case of articles on different subjects the results are similar as well. 7 Paper Title ICI 10 - 10TH INTERNATIONAL CONFERENCE ON INFORMATION A significant difference supporting the correlation between the subject of the given article and the proficiency of the sample group could not be discerned. Content knowledge and non content-specific procedural proficiency in text condensation and the identification of the essence of texts does not result in significant differences in case of the student sample group. According to the results of the analysis we can conclude that significant differences can be discerned concerning the abstracts prepared by students and that of professional experts further proving the hypothesis that the level of content knowledge impacts the final results of the abstract preparation process. In case of students the content-specific knowledge does not lead to major differences as the abstracts of both articles prepared by the three student groups show a great similarity to each other but at the same time a major difference can be observed in relation to abstracts prepared by professional experts. Students tend to prefer sentences located in the beginning of the texts while the professional experts tend to form the sentences of the abstracts from the total textual domain. Moreover, the student markings are more homogeneous than that of the rankings prepared by experts representing varying proficiency levels and differing professional backgrounds. The identification of the crucial sentences of the text does not depend on the proficiency level of the student groups. The abstracts prepared by student groups majoring in library science and informatics programs at colleges and universities show no significant divergence. The close positive connection is indicated by a higher than 0.9 correlation value of the weighed numbers of the given sentences in the respective abstracts. As one of my priorities was the examination of the efficiency of the abstract preparation program. Consequently, I had to develop clear guidelines concerning the appropriate output and I hoped that the several similar results would provide me with a term of reference and comparison. But I get of the result, that there are no two people who would prepare the same abstract. 4. Conclusion (12 pt Times New Roman, Bold) In conclusion I would point out that automatization procedure, which was presentation in my article, can only be used in case of discursive texts during which the author concentrates on one topic and uses objective statements with a consistent vocabulary, form, and structure instead of writing in a literary language. Generally, we can conclude that the Luhn method is more 8 ICI 10 - 10TH INTERNATIONAL CONFERENCE ON INFORMATION Paper Title effective in case of scientific statements, publications and reports, than it would be in case of a sophisticated literary work. Whereas automated abstract preparation systems have not yet been implemented in Hungary, the demand for such apparatus appears to increase, well it can be utilised in numerous areas: Researchers and instructors can increase their familiarity of the respective professional texts. The search engines of the various web-pages are continously improved as finding the right information is crucial. But what happens after locating the respective information, or when a user reaches a certain web-page and gains access to a downloadable article or book? At this point the user is left to his or her own devices and is compelled to perform the time consuming activity of reading through the given material. If we take into consideration the continuously increasing number of published scholarly articles and texts it becomes obvious that a researcher, instructor, or expert is forced to allocate more time to keeping up with the latest results of his or her discipline than humanly possible.This is, however, not a discipline specific problem as the ever-increasing number of publications leads to a declining novelty value along with more frequent overlaps and redundance. My program can provide much needed help enabling the researcher to peruse the content of the given article before making a decision to perform a thorough reading. Furthermore, the program helps in the identification of overlapping information as well. While most integrated library systems store and provide extracts of the given documents, the limited capacity of library staff does not allow a full utilization of this feature. Consequently, the program can be potentially used in preserving senior theses and dissertations in an electronic format, along with expanding the applicability and processability options of the latter. Although more and more journals accessible on-line furnish extracts of the respective articles, there are still several publications which do not go beyond offering a table of contents from wich the full text of the article can be reached. My program can significantly contribute to the improvement of this situation. Another potential use of the program concerns the publication activity of researchers and instructors who are often required to prepare extracts or summaries of their own work. This, however, raises a significant theoretical dilemma as there are several conflicting views regarding the propriety of an author preparing the summary or extract of his own work. The self-prepared extracts usually do not meet 9 ICI 10 - 10TH INTERNATIONAL CONFERENCE ON INFORMATION Paper Title the criteria of the genre as the authors rather explain than summarize their own works.Consequently, if authors prepared a separate extract of their work by the help of the program and formed and improved it subsequently, they not only would save time, but could go a long way in eliminating the errors of bias. Furthermore, due to the increasing popularity and availability of digitalization projects more and more documents become available in electronic format which is not only encouraging, but appears to justify the need for the elaboration of an extracting program. References (12 pt Times New Roman, Bold) [1] Horváth T., Papp I., „Könyvtárosok kézikönyve 2”,„Librarians’ Reference Book 2” Budapest, Osiris, (2001) pp.146. [2] Szalai Sándor, „Gépi kivonatkészítés.” „Mechanized extract preparation” Budapest, Országos Műszaki Könyvtár és Dokumentációs Központ, (1963) pp.25-42. [3] Horváth T., Papp I., „Könyvtárosok kézikönyve 2”,„Librarians’ Reference Book 2” Budapest, Osiris, (2001) pp.146-147. [4] Horváth T., Papp I., „Könyvtárosok kézikönyve 1”,„Librarians’ Reference Book 1” Budapest, Osiris, (1999) p.107. [5] Prószéki G., “Számítógépes nyelvészet.” “Computer Linguistics” Budapest, (1989). p. 489-492. [6] Horváth T., Papp I., „Könyvtárosok kézikönyve 2”,„Librarians’ Reference Book 2” Budapest, Osiris, (2001) p.146. [7] Horváth T., Papp I., „Könyvtárosok kézikönyve 1”,„Librarians’ Reference Book 1” Budapest, Osiris, (1999) p.56. [8] Lengyelné Molnár Tünde, “Gyakorisági szótárak. Magyarországi helyzetkép.” “An assessment of the availabilty of word frequency dictionaries in Hungary”, Könyvtári Figyelő, vol 52. n. 1. p. 45-58. (2006.) [9] Magyar Nemzeti Szövegtár; Szószablya szótár; Magyar Értelmező Kéziszótár gyakorisági adatai Hungarian National Text Collection. Word Sabre Dictionary. Frequency data acquired from the Hungarian Dictionary of Definitions [10] Pl. „Összefoglalva”; „A lényeg”; „Ne feledjük” i.e. Such words are: ”in sum, the essence is,, let us not forget” Multimedia Data: (12 pt Times New Roman) 1) TACT software, http://www.chass.utoronto.ca/tact/index.html 2) TACT software, http://www.chass.utoronto.ca/tact/index.html, T-LAB software, http://www.tlab.it> 3) TACT software, http://www.chass.utoronto.ca/tact/index.html> 10 Paper Title ICI 10 - 10TH INTERNATIONAL CONFERENCE ON INFORMATION 4) TEXTPACK homepage, http://www.gesis.org/en/software/textpack/index.htm>, T-LAB homepage, http://www.tlab.it 5) T-LAB homepage, http://www.tlab.it> 6) CONCORDANCE homepage, http://www.concordancesoftware.co.uk/ 11