: Supporting Multilexical Queries in SQL A. K J R. H

advertisement
LexEQUAL: Supporting Multilexical Queries in SQL
A. K UMARAN
1. Introduction
Current database systems offer support for storing multilingual data [2], but are not capable of querying across
languages, an important consideration in today’s global
economy. We therefore propose a new multilexical operator called LexEQUAL, that extends the standard lexicographic matching in database systems to matching of text
data across languages, specifically for names, which form
close to twenty percent of text corpora.
Our implementation of the LexEQUAL operator
is based on transforming matches in language space
into parametrized approximate matches in the equivalent phoneme space. A detailed evaluation of our approach
on a real data set shows that there exist settings of the algorithm parameters with which it is possible to achieve both
good recall and precision.
2. The LexEQUAL Operator
Catalog of Multilingual Books.com
Consider a hypothetical Books.com that sells books
in different languages, with a sample product catalog as shown above, and a user query to retrieve all the
works of an author, in a set of specified languages. This
can be easily achieved using LexEQUAL as shown below:
SELECT * FROM Books
WHERE Author LexEQUAL ‘Nehru’ Threshold 0.3
English, Hindi, Tamil, Arabic IN
The output for this query will contain all records of the
above Books.com table, for whom the Author attribute
contains names that are phonemically close to "Nehru",
with the Threshold parameter in the query determining the
match quality tradeoff between precision and recall.
Department of Computer Science and Automation, Indian Institute of
Science, Bangalore, INDIA. kumaran, haritsa@csa.iisc.ernet.in
Proceedings of the 20th International Conference on Data Engineering (ICDE’04)
1063-6382/04 $ 20.00 © 2004 IEEE
JAYANT R. H ARITSA
LexEQUAL ( , , )
Input: Strings , , Error Threshold, Languages with TTP transformations, 1. Language of ; Language of ;
2. if and then
3.
transform( , ); transform( , );
( ? : );
4.
5.
if editdistance )
6.
then return TRUE else return FALSE;
7. else return NORESOURCE;
The pseudocode of our implementation of LexEQUAL
is given above. The algorithm transforms the input multilingual strings to their equivalent phoneme strings and flags a
match if the edit distance between them is less than the userspecified error limit, Threshold. The transform function uses standard Text-to-Phoneme (TTP) converters that
convert to a canonical International Phonetic Association’s
phonemic alphabet. Further, the LexEQUAL implementation is parameterized for different cost functions of editdistance function and for domain-specific clustering of similar phonemes.
We have currently implemented LexEQUAL as a
user-defined function, incorporating Q-Grams [1] to filter out non-matches inexpensively and Phonemic Indexes [4] to narrow the search to potential matches
using standard database index structures. Our initial results on a commercial database system hold out the promise
that an inside-the-server implementation of multilingual matching will have runtimes comparable to traditional
monolingual matching.
In summary, the LexEQUAL operator employing phonetic matching can complement the standard lexicographic
operators, representing a first step towards achieving complete multilingual functionality in database systems. The
full version of this paper is available in [3].
Acknowledgements This work was supported in part by a Swarnajayanti
Fellowship from the Dept. of Science and Technology, Govt. of India.
References
[1] L. Gravano et al. Approximate String Joins in a Database (almost) for Free. Proc. of 27th VLDB Conf., 2001.
[2] A. Kumaran and J. Haritsa. On the Costs of Multilingualism
in Database Systems. Proc. of 29th VLDB Conf., 2003.
[3] A. Kumaran and J. Haritsa. Supporting Multiscript Matching in Database Systems. Proc. of 9th EDBT Conf., 2004.
[4] J. Zobel et al. Phonetic String Matching: Lessons from Information Retrieval. Proc. of 19th ACM SIGIR Conf., 1996.
Download