Introduction

advertisement
A Method to Access Linguistic Information
Considering Hierarchical Structures of Languages
Shoji Mizobuchi
Faculty of Science and Engineering, Kinki University, Japan
mizo@info.kindai.ac.jp
Kazuaki Ando
Faculty of Engineering, Kagawa University, Japan
ando@eng.kagawa-u.ac.jp
Abstract: This paper proposes a method to access linguistic information about natural language
expressions appearing in documents on the Web. Natural language expressions are surface forms of
linguistic elements, such as characters, words, and phrases, which can be represented hierarchically.
The proposed method takes hierarchical structures of languages into consideration and provides a
way for the reader to select a specific natural language expression from a string and display
associated linguistic information. The proposed method simplifies the use of language resources
when unknown natural language expressions are encountered. The proposed method has been
compared with several conventional methods for input usability. The result of the comparison has
confirmed that the proposed method has characteristics that are superior to conventional methods.
Introduction
In recent years, the development of information and communication technology and the digitization of
conventional media have resulted in many people frequently reading documents on the Web. Under such a situation,
there have been an increasing number of cases where documents on the Web are used as educational materials in
reading and language learning (Hsu, 2010; Petersen, 2009). One of the problems that occur reading a document on
the Web is that a reader may encounter natural language expressions that have unknown linguistic features
(hereafter referred to as unknown expressions). Frequent appearance of unknown expressions in a document
hampers the readability of the document. For example, it increases the time spent reading, reduces the amount of
knowledge gained, and decreases motivation for reading. These adverse effects can happen for all readers, but occur
more frequently for readers whose vocabularies are underdeveloped (i.e., younger or lower grade students, L2
learners, and novices in disciplines).
A general method to find linguistic information about an unknown expression is to look the expression up
in a language resource, such as a dictionary or glossary (Fraster, 1998). However, this takes time and effort. A
number of methods have been proposed to make it easy to consult a language resource (Chun, 2001; Ercetin, 2003),
but in conventional methods, text from a Web page that is input as a search key is limited to specific types of natural
language expressions. The input text can consist of several types of linguistic elements, such as characters, words
and phrases, which can be represented hierarchically. To eliminate the limitations of the conventional methods, the
hierarchical structures of languages must be considered.
In this paper, we propose a method to access linguistic information about natural language expressions
appearing in documents on the Web that takes the hierarchical structures of languages into consideration. We also
compare our method and three types of conventional methods for input usability. Our method simplifies the
consultation of language resources when unknown expressions are encountered and is expected to support reading
Web pages.
The remainder of the paper is organized as follows. In Section 2 we describe the consultation of natural
language resources, and in Section 3 we propose a method to access linguistic information. The results of a
comparison between our method and conventional methods are described in Section 4. Finally, Section 5 presents
conclusions and describes the direction of future work.
Figure 1: A Hierarchical Structure of Linguistic Elements
Consulting Natural Language Resources
When an unknown expression is encountered in a Web page, a natural language resource is generally
consulted to clarify the expression. In this paper, consulting a natural language resource means trying to obtain
linguistic information about an unknown expression from an electronic natural language resource, such as a
dictionary or a glossary. The first step in this procedure is the selection of a natural language resource. The time
taken to perform this operation and how often it needs to be performed varies for different resources.
Once a resource has been selected, a reader has to perform the following procedures.
(1) Input an unknown expression as a search key.
(2) Perform a search.
(3) Find appropriate linguistic information in the result.
The Proposed Method
We propose an improved method to access linguistic information about natural language expressions
appearing in documents on the Web. The proposed method takes the hierarchical structure of language into
consideration and allows a reader to specify any type of natural language expression as a search key quickly and
effortlessly.
Hierarchical Structures of Languages
Natural language expressions are surface forms of linguistic elements, such as characters, words, and
phrases. The elements may be represented hierarchically to illustrate a text string, as shown in (Fig. 1).
The easiest way to identify a linguistic element is to select the string in which the corresponding natural
language expression appears. However, this can result in ambiguity because a number of linguistic elements become
candidates for the search key used when consulting a language resource. Our method introduces a Graphical User
Interface (GUI) component. Its operating procedures visualize such a situation and permit selection of a specific
linguistic element from the available candidates.
Figure 2: A Terrace with a Popup Balloon
GUI Component
In our method, the GUI component is denoted as a terrace. A terrace displays a string of natural language
expressions associated with a position selected by the reader and allows a reader to select one element and display
linguistic information about it. A terrace consists of overlapped rectangles that indicate linguistic elements.
Rectangles in a terrace are different colors and sizes, and include a button to open and close a popup balloon. When
a reader selects a rectangle in a terrace, linguistic information about the natural language expression in it is displayed
in a popup balloon. A terrace with a popup balloon is illustrated in (Fig. 2).
Operating Procedure
The operations, performed from the time a reader encounters an unknown expression until the reader
obtains explanatory linguistic information, are as follows.
(1) The reader selects an arbitrary point proximate to the unknown expression. Then, a terrace displays natural
language expressions in the vicinity of the selection.
(2) The reader selects the unknown expression if it exists in the terrace. (If the expression does not exist, the
procedure is over.)
(3) The reader selects the unknown expression in the terrace. Then, a popup balloon displays linguistic information
for the unknown expression.
This procedure is illustrated in (Fig. 3).
An advantage of our method is that it is possible to confirm the existence of the linguistic information in
the first step.
Implementation
Our method has been implemented as a Web application using HTML, CSS, JavaScript, and Java.
Application screenshots are shown in (Fig. 4) through (Fig. 6). The texts displayed in all screenshots are the
beginning of “The Spider’s Thread” (Kumo no ito)[1], which is a short story by Ryunosuke Akutagawa. (Fig. 4)
shows the terrace that appears when a reader selects the character “S” in the phrase “The Spider’s Thread.” The
terrace displays three natural language expressions, “S,” “Spider,” and “The Spider’s Thread.” because they are
associated with the element that the reader has selected. (Fig. 5) shows the popup balloon that appears when the
word “Spider” is selected. Linguistic information for “Spider” is displayed in the balloon. (Fig. 6) shows the popup
balloon that appears when the reader clicks on the phrase “The Spider’s Thread.” Linguistic information for the
phrase is displayed in the balloon.
Figure 3: Operation Procedure
[1] http://www.edogawa-u.ac.jp/~tmkelly/research_spider.html
Figure 4: Screenshot 1
Figure 5: Screenshot 2
Figure 6: Screenshot 3
(a) Consideration of
Hierarchical Structures
Type
Criteria
(b) Number of Operations
(c) Timing of Notification
of the Existence of
Linguistic Information
Typing Method
-
-
-
Highlighting Method
-
+
-
Pointing Method
-
+
+
Our Method
+
+
+
Table 1: Comparison of our Method to the Three Types of Conventional Methods
Evaluation
We determined three criteria and compared our method to three types of conventional methods. The types
of conventional methods and the comparison results based on the criteria are described below.
Types of Conventional Methods
The three types of conventional methods are typing, highlighting, and pointing using a mouse. Typing a
natural language expression as a search key is typically used in dictionary services. Highlighting is used in other
Web-based services, such as netLearn (Chun, 2001) and Google Dictionary[1]. Pointing methods are used in
translation services, such as in “Rikai” [2] and “popjisyo” [3].
Comparison Result
The criteria used to compare our methods and the conventional methods are: (a) consideration of
hierarchical structures of languages, (b) number of operations, and (c) timing of notification of the existence of
linguistic information. The result of the comparison is summarized in (Tab. 1). As shown in (Tab. 1), our method
was assessed positively for all criteria. Therefore, our method is considered to be an effective way to access
linguistic information when consulting language resources. The application of each criterion is described below.
(a) Consideration of Hierarchical Structures of Languages
For criterion (a), each method is rated “+” if it can deal with distinct types of natural language expressions
appearing within the selected string; a method is rated “-” if it cannot deal with distinct types. Conventional methods
obtain a natural language expression or a position from readers. In contrast, our method obtains both a natural
language expression and its type. When distinct types of natural language expressions appear within the selected
string, the specific type is required to narrow the search. Thus, only our method can deal with distinct types of
natural language expressions appearing within the selected string.
(b) Number of Operations
For criterion (b), each method is rated “+” if the number of operations is small; “-” is assigned if the
number of operations is large. Because an IME must be used, the number of operations in the typing method is
comparatively large. In contrast, the number of operations in the other methods is constant and small. In the
highlighting method, the number of operations is at least four. A reader identifies the first character of a natural
language expression, sets the first character as the start point, identifies the last character of the expression, and sets
the last character as the end point. In the pointing method, the number of operations is one because a reader only has
to click on a natural language expression. In our method, the number of operations is three as described previously.
[1] https://chrome.google.com/webstore/detail/google-dictionary-by-goog/mgijmajocgfcbeboacabfgobmjgjcoja
[2] http://www.rikai.com/
[3] http://www.popjisyo.com
(c) Timing of Notifying of the existence of linguistic information
For criterion (c), each method is rated “+” if a reader is notified of the existence of linguistic information
about a natural language expression input as a search key before consulting a language resource; “-” is assigned if
the reader is notified after consulting a language resource. In the typing and highlighting methods, the existence of
linguistic information can be confirmed after a reader performs a search. In the pointing method and our method,
linguistic information about natural language expressions in a Web page is available without an active search
function because this information is obtained when the reader accesses the Web page.
Conclusion
In this paper, we have proposed a method to access linguistic information about natural language
expressions appearing in documents on the Web. Our method takes the hierarchical structures of languages into
consideration. We have also confirmed that our method is superior to conventional methods in terms of input
usability. Our method simplifies consultation of natural language resources when unknown expressions are
encountered and is expected to support reading Web pages.
In the future, we will conduct an experiment to verify whether our method is effective for actual users. The
purpose of our method is to support readers; however, the effectiveness of the method has not been tested and
confirmed experimentally.
References
Chun, M. D. (2001). L2 Reading on the Web: Strategies for Accessing Information in Hypermedia, Computer Assisted Language
Learning, 14(5), 367-403
Ercetin, G. (2003). Exploring ESL Learners’ Use of Hyper-media Reading Glosses, CALICO Journal, 20(2), 261-283.
Fraster, C. A. (1998). The Role of Consulting a Dictionary in Reading and Language Learning, Canadian Journal of Applied
Linguistics, 2(1-2), 73-89
Hsu, C., Hwang, G., & Chang, C. (2010). Development of a reading material recommendation system based on a knowledge
engineering approach, Computers & Education, 55(1), 76–83
Petersen, S. E., & Ostendorf M. (2009). A machine learning approach to reading level assessment, Computer Speech & Language,
23(1), 89-106
Acknowledgements
This work was supported by JSPS KAKENHI Grant Number 23700996.
Download