Art11_E

advertisement
Translating Natural Language Queries in Spanish to SQL
involving Group By
Jose A. Martínez F.1, Alberto Ochoa-Zezzatti2, and Andrés Bautista1
1
Instituto Tecnológico de Ciudad Madero (Mexico)
2
Juárez City University
jose.mtz@itcm.edu.mx, Alberto.ochoa@uacj.mx, andres.bautista@live.com.mx
Abstract. This paper describes the analysis carried out in the translation of
Natural Language Queries in Spanish to SQL involving the clause of grouping
GROUP BY in Natural Language Interfaces to Databases (NLIBDs), the important role and the different ways to find them in the Natural Language.
Keywords: Aggregate Functions, Group By clause, Natural Language Interfaces, Natural Language Processing.
1 Introduction
Currently the majority of the information stored in databases (BD), is subsequently
consulted for decisions make. To facilitate consultation of information in the databases have developed several tools that allowed the easy job of users (e.g. consultations assistants, graphical interface with menus, etc.), Many of the tools developed
can generate queries information to meet the requirements of users, however can’t
perform any type of query because of its limitations, for this reason we developed the
Natural Language Interfaces to Databases (NLIBD) through which we can get the
information from a BD with a natural language query [1].
Some of these queries containing statistical expressions equivalent to processing
the data stored in one or more tables, through aggregation functions and the GROUP
BY clause. With SQL data can be grouped and added so that users can interact with
them on a higher level of granularity, as stored data in databases.
For that an ILNBD provide the information requested in a query, this must know
oral or written expression of the people, which communicate in Spanish Natural Language, the processing queries starts with a lexical analysis and finish at the time of
generate the SQL query.
The ILNBD have developed since the 60s and unfortunately not generated a 100%
of correct answers to queries provided by users, this is mainly because most of
ILNBD do not have the ability for processing queries involving aggregate functions
or grouping [2].
This article is a description and analysis of queries involving aggregation and
grouping functions, showing examples and reviewing the process necessary for the
correct translation into SQL form.
2 Natural Language Interface
The natural language processing (NLP) is a set of computational techniques to analyze and represent texts naturally in one or more levels of linguistic analysis, in order
to carry out the processing of language as a human for a range of tasks and applications [3].
Natural language interfaces are mechanisms of communication between persons
and a machine through natural language. Typically, this communication is bidirectional, (i.e. question-answer type). The general architecture of an ILN is shown in
Figure 1.
Fig. 1. General Architecture of ILN
3 Natural Language Interface to Databases
The Figure 2 shows the flow of NLIDB, in which the result is usually presented in
two ways, as in SQL statement or as an answer in natural language. In this article the
results are returned as SQL language instruction.
Fig. 2. NLIDB Flow
Some major NLIDB founded in the literature that have been developed are described in Table 1, further noting the use of aggregation functions [1].
Table 1. Main NLIBDs developed
Interface
Aggregate Functions
TAMIC (1996)
IDICULA (1999)
PRECISE (2003)
InBase (2003)
NLPQC (2005)
Translator CENIDET (2005)
WYSIWYM (2006)
Translator OWDA
Dravidian Language (2007)
C-PHRASE (2008)
Translator Rojas (2009)
STK (2010)
Translator Esquivel (2010)
Current job ITCM (2012)
X
X
X
X
X
X
X

X
X
X


4 Aggregate Functions and GROUP BY clause
Aggregate functions are functions that take a collection of values as input and produce
a single output value. SQL provides five primitive aggregation functions:
1. COUNT: returns the total number of rows selected.
2. SUM: Adds the values of a column.
3. MIN: returns the minimum value of a column.
4. MAX: returns the maximum value of a column.
5. AVG: Calculate the average value of a column.
In addition to expanding the use of aggregation functions is necessary to use of
GROUP BY clause, which used to group rows by specific columns. [4].
5 Analysis of Translation of Aggregate Functions and GROUP
BY clause
As we have seen in section 4, the aggregation functions allow us to perform operations on the information to be able get a better result in our queries to databases.
To better understand the use of aggregate functions, then show the syntax they use.
 MAX and MIN Syntax:
SELECT MAX/MIN ("name_of_column")
FROM "name_of_table"
Example in Spanish Natural Language:
“Dame el precio mayor de los productos” (Give me the higher price of the
products).
The SQL sentence generated is:
SELECT MAX(precio)
FROM
PRODUCTOS
 SUM Function and GROUP BY clause:
SELECT "name1_column", SUM ("name2_column")
FROM
"name_table"
GROUP BY "name1-column"
Example in Spanish Natural Language:
“¿Cuantos trabajadores hay en cada departamento?” (How many employees
are there in each department?).
The SQL sentence corresponding is:
SELECT department, count (employee)
FROM
departments, employees
WHERE
departments.id = employees.idDepartment
GROUP BY department
As shown in the above examples use of aggregation functions allows the user to
get more specific information.
If aggregate functions are as necessary and extensively used in the real world, what
is the cause which prevents the implementation of so useful recovery options in ILNBDs information? To get the answer to this question is needed extensive analysis on
translation techniques for each NLINDBs developed. But as we talked Natural Language Spanish, we can see some patterns that sentences or phrases used in the queries
that are made on ILNBDs indicating use of aggregate functions and GROUP BY
clause.
To understand the above we have focused on the analysis of some queries of the
corpus of the Linguistic Database Cultures of the World: A Statistical Reference, an
adaptation of Philip M. Parker. The BD mentioned has only two tables (social_demography, geography), where it is concentrated the information of the linguistic groups of the world, its geography, demographics, etc.
Examples of querys of the corpus:
1. Sociedades que viven en clima templado (Societies that live in temperate climate).
2. Nivel de Deforestación (Doforestation level).
3. Mayor número de fronteras(Greater number of frontiers).
4. Mayor ocurrencia de Terremotos (Major occurrence of earthquakes).
5. Clasificación de Sociedades por Huso Horario (Society clasification by TimeZone).
Query 1 should be resolved properly in any of the ILNBDs and generates a SQL
statement similar to the following:
SELECT society
FROM
geography
WHERE
climate = ’templado’
For query 4 the response of ILNBDs most current, if give any response, would be
omit the word 'classification' and show the occurrence of earthquakes in all societies,
the SQL translation would be the following.
SELECT earthquakes
FROM
geography
Some ILNBDs are adaptive and may add new patterns of recognitions of sentences, but would imply add a new pattern for each type and structure of oration that can
be formed in our extensive Spanish Natural Language, increasing the use of resources
needed for processing.
What happens if the query 3 is introduced in other ways? Examples:
─ Dame el número mayor de fronteras (Give me the largest number of frontiers).
─ Muéstrame el mayor número de fronteras (Show me the greater number of frontiers).
─ De las fronteras ¿cuál es el número mayor? (Of frontiers what is the largest number?).
When the query is analyzed in detail we note that the degree of difficulty to understand, through a language translation as those used in ILNBDs increases, which is
why deserve aggregation functions will be analyzed from different points of view,
before he could speak its implementation.
Returning to the example of the query 4, we note that the user is requesting only
one fact, the higher occurrence of earthquakes that have registered, to solve this query
and not shed excess information or erroneous through the MAX aggregation function
can show data that the user requests, see the equivalent query in SQL.
SELECT MAX(earthquakes)
FROM
geography
As we can see the grouping is very important, although it is clear that the examples
are very simple due to the database that was used, but in companies where information is concentrated part of a large network of department stores and all information is stored in a single BD, means developing a complex analysis of the query
that is being requested and include the necessary relationships, we see this with an
example query in Spanish Natural Language:
“Dame el número de trabajadores del departamento de carnes de las sucursales de
la ciudad de México que tengan menos de 2 años de antigüedad”.
(Give me the number of workers in the meat department of the branches of Mexico
City with less than 2 years of antiquity)
To solve the above query is first necessary to determine the relationship, in this
case, conjoined entities to obtain the necessary information are empleados, departamentos, sucursales y ciudades (Employees, departments, branches and cities), then we
have to consider whether need to use aggregate functions or Group By clause, for this
case are necessary both, the SUM aggregate function to count the number of workers
and the clause to group by department.
This article is the beginning of the development of a master project that is planned
in the Technological Institute of Ciudad Madero (ITCM), which aims to solve the
translations of Spanish natural language queries on relational database to extend on a
translation domains ILNBD.
Some of the keywords that we will be analyzing when translating queries to identify the use of aggregate functions are shown in Table 2, in a column that is the word in
Spanish and another NL aggregation function corresponding remembering that only
show the main, however, the number will increase in our implementation because
they consider all possible synonyms that exist in the Spanish LN and words or phrases
that may arise.
6 Conclusions
As we have seen throughout this article, the development of natural language interfaces that translate queries involving aggregate functions and GROUP BY clause
requiring a good discussion and good solution strategy to allow correct translation
query to be processed by the interface.
Resolve issues important to natural language processing and applying them in NLI,
enhances the domain of information that can be obtained any relational DBs.
Table 2. Word analysis to use aggregate functions and GROUP BY clause
Palabra/Frase (Word/Phrase)
Cuantos
Suma
Promedio
Media
Máximo
Mayor
Mínimo
Menor
Todos los(as)
El Total
Agrupado
Clasificado
Función de Agregación/Cláusula
(Aggregate Functions/Clause)
COUNT
SUM
AVG
AVG
MAX
MAX
MIN
MIN
COUNT
SUM
GROUP BY
GROUP BY
References
1. Rojas J.C. Administrador de Diálogo para una Interfaz de Lenguaje Natural a
Bases de Datos, 2009.
2. Androutsopoulos I., Ritchie G.D., Thanisch P. Natural Language Interfaces to
Databases - An Introduction. Natural Language Engineering, 1995
3. Liddy D, Natural Language Processing for Information Retrieval & Knowledge
Discovery, School of Information Studies, 2001.
4. Carme Martín Escofet, El lenguaje SQL.
Download