Soonho Kim
(Soonho.kim@fao.org)
The multilingual semantic search assistant (MSSA) was designed as a plug-in for any kind of search engine powered by Apache Lucene (Lucene 2008). Thus, implementation of the MSSA focused on the modification of the query string itself, instead of modifying the target search engine directly. As illustrated in Figure 1, users start formulating their query in the MSSA. Then, pressing the “submit” button, the system automatically takes users to the target search engine homepage which shows the results of the user’s query. There is no need to change any code in the target search engine; the only thing to be done is add a link to the MSAA on the target search engine homepage.
Figure 1. Communication between Lucene-based search engine and the multilingual semantic search assistant system associated with the AGROVOC Web Service.
The MSSA system consists of four independent functions based on the requirements: 1) “build complex search” function 2) “browse topics” function 3) “expand languages” function and 4) “expand synonyms” function in the five official FAO languages.
A “build complex search” should be intuitive and easy to use. A Venn diagram consists of two or more intersecting circles representing relationships among given sets. It is frequently used to represent graphically the results of combining basic Boolean operators. However, most Boolean search interfaces employ a text-based or drop-down box (containing Boolean operators) interface. For example, the advanced search in Google supports textbased Boolean operators as shown in Figure 2 (left). The MSSA employs the Venn diagram-based user interface illustrated in Figure 2 (right). It makes users feel more comfortable using Boolean operators because they do not have to specify Boolean operators explicitly. Instead, just clicking the area in which they are interested automatically creates keyword queries with Boolean operators.
Figure 2. A screenshot of the Venn diagram-based Boolean search interface in the MSAA (right) and the
Google advanced search interface (left).
2.1.1. How to use the “build complex search” function
There are four text input box below a large input text box shown in the Figure 3.
Figure 3. A screenshot of the Venn diagram-based Boolean search interface in the MSAA.
Users can type any keywords interested in shown in Figure 4. Let’s say that a user is very interested in finding documents containing “biosecurity for plant”. In this case, the user can start typing a keyword “biosecurity” in the first text box denoted with A. Then, MSSA automatically generates a corresponding diagram denoted with A using filled blue circle.
Figure 4. A screenshot of the Venn diagram-based Boolean search interface typed “biosecurity”
As same as “biosecurity”, the user can create two more circles (green circle and yellow circle) by typing keywords: “plant” and “animal”, because he or she is only interested in biosecurity for plant not animal. Next step is to specify which area in the Venn diagram is the area the user is interested in. For example, in the Figure 5, the black area in Venn diagram refers to intersection of three keywords.
Figure 5. A screenshot of the Venn diagram-based Boolean search interface typed “biosecurity”, “plant” and “animal”.
In the Figure 6, the user selects blue area which includes intersection of “biosecurity” and “plant” but excludes
“animal”. That means that he or she is interested in documents containing two keywords: biosecurity and plant and does not want to include the keyword “animal”. Then, the user can press “submit” button to call AGRIS/CARIS search engine with a formulated query (biosecurity +plant - animal).
Figure 6. A screenshot of the Venn diagram-based Boolean search interface typed “biosecurity”, “plant” and “animal” to find documents relevant to “biosecurity for plant”.
AGRIS/CARIS search engine automatically searches documents which containing “biosecurity and plant” and remove any document containing the keyword “animal” and shows a result set of documents to the user.
Figure 7. A screenshot of AGRIS/CARIS search engine corresponding with the formulated query by
MSSA automatically.
This “browse topics” function provides to browse all terms in AGROVOC thesaurus. Originally, browsing was designed by alphabetical order. However, there is no problem to browse Latin-based languages such as English,
French and Spanish by Latin-alphabet and Arabic language. But, there is a difficulty to order by Chinese alphabet, because Chinese language consists of over 47,035 characters. So, it would be not an efficient way to display 47,035 characters to browse AGROVOC terms. To find more efficient way, mapping from Latin-alphabet to Chinesealphabet was employed for “browse topics” function. Based on sound of each character of Chinese language, mapping was collected from Chinese input system 1 .
1 http://www.inputking.com/EN/index.php
2.2.1 How to use “browse topics” function
Figure 8. A screen shot of the front page of “browse topics” function in MSSA. The default language is
English.
As shown Figure 8, this function shows five official languages of FAO including Arabic, Spanish, English,
French, and Chinese. Then, next line shows corresponding alphabets of each language. In Figure 8, selected language (default language) is English. So, it shows 26 English-alphabet characters from A to Z. Users can select any language by clicking five radio buttons or by selecting language on the right top of the current page. In Figure 9,
Arabic-alphabetical characters are shown when selecting Arabic language.
Figure 9. Arabic-alphabetical characters in this function.
Figure 10. a screen shot of showing Chinese AGROVOC terms corresponding to mapping between Latinalphabet “A” and Chinese characters.
Especially, shown in Figure 10, browsing for Chinese is different from other languages which show their own alphabets. It shows Latin-alphabet characters instead of Chinese-alphabet characters. As mentioned above, users can select Latin-alphabet based on sound of the first character of Chinese terms. For example, a term “ 仰口 线 虫 属 ” starts with “A” sound, because of the first character of the term “ 仰 ”. So, when users know the sound of the first character of any term, they can browse Chinese terms in AGROVOC thesaurus easily.
Figure 11. “browse topics” function providing term description in detail by selection given term “ 埃及 ”
Shown in Figure 11, users can retrieve more detail information about each term by clicking any term. For example, in Figure 11, a Chinese term “ 埃及 ” was described with broader term, related terms, and used-for terms.
When users want to search selected term in AGRIS/CARIS search engine, then just click “search AGRIS” button.
Then, this function automatically leads users to a search result of AGRIS/CARIS search engine providing selected query to the search engine shown in Figure 12.
Figure 12. A search result of selected query “ 埃及 ” in AGRIS/CARIS search engine.
The MSSA applies a thesaurus-based query translation approach in which keywords typed by the user are translated to selected target languages using the AGROVOC thesaurus. While a machine-based translation approach has the limitation of representing the sense of the original query, AGROVOC thesaurus-based query translation can be done in a straightforward fashion because every translation is already verified by domain experts. A disadvantage of this approach is that AGROVOC can not cover all domains everywhere in the world.
However, this limitation might be overcome by including other thesauri covering different domains.
Cross-language support functionality of the MSSA called “expand languages” is illustrated in Figure 13.
Because AGROVOC contains 35,000 terms per language, it is important that the function shows available terms to users based on their interest before performing translation and query expansion. Thus, the function provides context-sensitive terms whenever users type a character as shown in Figure 13 (left). When user selects a concept, the MSSA automatically calls the AGROVOC WebService to obtain translations into the five official languages. Then users can add or delete languages using the checkboxes shown in Figure 13 (right).
Figure 13. Cross-language query support in the MSSA called “expand languages”. The left picture shows available concepts according to user input and The right one illustrates the interface of cross-language query support.
Domain-specific knowledge is the most important resource for the query expansion. AGROVOC can again play a role, since it provides a variety of synonyms and acronyms and is already officially approved in the AGROVOC communities. So, the domain-specific knowledge discussed in section 2.1 for synonym expansion was implemented using the AGROVOC WebService. For example, the term “water balance” which a user selects is expanded by adding the AGROVOC synonyms “water budget”, “water saturation” and “evaporate demand” shown in Figure 5.
Figure 5. Synonym expansion using the AGROVOC thesaurus.