Translating Natural Language Queries in Spanish to SQL involving Group By Jose A. Martínez F.1, Alberto Ochoa-Zezzatti2, and Andrés Bautista1 1 Instituto Tecnológico de Ciudad Madero (Mexico) 2 Juárez City University jose.mtz@itcm.edu.mx, Alberto.ochoa@uacj.mx, andres.bautista@live.com.mx Abstract. This paper describes the analysis carried out in the translation of Natural Language Queries in Spanish to SQL involving the clause of grouping GROUP BY in Natural Language Interfaces to Databases (NLIBDs), the important role and the different ways to find them in the Natural Language. Keywords: Aggregate Functions, Group By clause, Natural Language Interfaces, Natural Language Processing. 1 Introduction Currently the majority of the information stored in databases (BD), is subsequently consulted for decisions make. To facilitate consultation of information in the databases have developed several tools that allowed the easy job of users (e.g. consultations assistants, graphical interface with menus, etc.), Many of the tools developed can generate queries information to meet the requirements of users, however can’t perform any type of query because of its limitations, for this reason we developed the Natural Language Interfaces to Databases (NLIBD) through which we can get the information from a BD with a natural language query [1]. Some of these queries containing statistical expressions equivalent to processing the data stored in one or more tables, through aggregation functions and the GROUP BY clause. With SQL data can be grouped and added so that users can interact with them on a higher level of granularity, as stored data in databases. For that an ILNBD provide the information requested in a query, this must know oral or written expression of the people, which communicate in Spanish Natural Language, the processing queries starts with a lexical analysis and finish at the time of generate the SQL query. The ILNBD have developed since the 60s and unfortunately not generated a 100% of correct answers to queries provided by users, this is mainly because most of ILNBD do not have the ability for processing queries involving aggregate functions or grouping [2]. This article is a description and analysis of queries involving aggregation and grouping functions, showing examples and reviewing the process necessary for the correct translation into SQL form. 2 Natural Language Interface The natural language processing (NLP) is a set of computational techniques to analyze and represent texts naturally in one or more levels of linguistic analysis, in order to carry out the processing of language as a human for a range of tasks and applications [3]. Natural language interfaces are mechanisms of communication between persons and a machine through natural language. Typically, this communication is bidirectional, (i.e. question-answer type). The general architecture of an ILN is shown in Figure 1. Fig. 1. General Architecture of ILN 3 Natural Language Interface to Databases The Figure 2 shows the flow of NLIDB, in which the result is usually presented in two ways, as in SQL statement or as an answer in natural language. In this article the results are returned as SQL language instruction. Fig. 2. NLIDB Flow Some major NLIDB founded in the literature that have been developed are described in Table 1, further noting the use of aggregation functions [1]. Table 1. Main NLIBDs developed Interface Aggregate Functions TAMIC (1996) IDICULA (1999) PRECISE (2003) InBase (2003) NLPQC (2005) Translator CENIDET (2005) WYSIWYM (2006) Translator OWDA Dravidian Language (2007) C-PHRASE (2008) Translator Rojas (2009) STK (2010) Translator Esquivel (2010) Current job ITCM (2012) X X X X X X X X X X 4 Aggregate Functions and GROUP BY clause Aggregate functions are functions that take a collection of values as input and produce a single output value. SQL provides five primitive aggregation functions: 1. COUNT: returns the total number of rows selected. 2. SUM: Adds the values of a column. 3. MIN: returns the minimum value of a column. 4. MAX: returns the maximum value of a column. 5. AVG: Calculate the average value of a column. In addition to expanding the use of aggregation functions is necessary to use of GROUP BY clause, which used to group rows by specific columns. [4]. 5 Analysis of Translation of Aggregate Functions and GROUP BY clause As we have seen in section 4, the aggregation functions allow us to perform operations on the information to be able get a better result in our queries to databases. To better understand the use of aggregate functions, then show the syntax they use. MAX and MIN Syntax: SELECT MAX/MIN ("name_of_column") FROM "name_of_table" Example in Spanish Natural Language: “Dame el precio mayor de los productos” (Give me the higher price of the products). The SQL sentence generated is: SELECT MAX(precio) FROM PRODUCTOS SUM Function and GROUP BY clause: SELECT "name1_column", SUM ("name2_column") FROM "name_table" GROUP BY "name1-column" Example in Spanish Natural Language: “¿Cuantos trabajadores hay en cada departamento?” (How many employees are there in each department?). The SQL sentence corresponding is: SELECT department, count (employee) FROM departments, employees WHERE departments.id = employees.idDepartment GROUP BY department As shown in the above examples use of aggregation functions allows the user to get more specific information. If aggregate functions are as necessary and extensively used in the real world, what is the cause which prevents the implementation of so useful recovery options in ILNBDs information? To get the answer to this question is needed extensive analysis on translation techniques for each NLINDBs developed. But as we talked Natural Language Spanish, we can see some patterns that sentences or phrases used in the queries that are made on ILNBDs indicating use of aggregate functions and GROUP BY clause. To understand the above we have focused on the analysis of some queries of the corpus of the Linguistic Database Cultures of the World: A Statistical Reference, an adaptation of Philip M. Parker. The BD mentioned has only two tables (social_demography, geography), where it is concentrated the information of the linguistic groups of the world, its geography, demographics, etc. Examples of querys of the corpus: 1. Sociedades que viven en clima templado (Societies that live in temperate climate). 2. Nivel de Deforestación (Doforestation level). 3. Mayor número de fronteras(Greater number of frontiers). 4. Mayor ocurrencia de Terremotos (Major occurrence of earthquakes). 5. Clasificación de Sociedades por Huso Horario (Society clasification by TimeZone). Query 1 should be resolved properly in any of the ILNBDs and generates a SQL statement similar to the following: SELECT society FROM geography WHERE climate = ’templado’ For query 4 the response of ILNBDs most current, if give any response, would be omit the word 'classification' and show the occurrence of earthquakes in all societies, the SQL translation would be the following. SELECT earthquakes FROM geography Some ILNBDs are adaptive and may add new patterns of recognitions of sentences, but would imply add a new pattern for each type and structure of oration that can be formed in our extensive Spanish Natural Language, increasing the use of resources needed for processing. What happens if the query 3 is introduced in other ways? Examples: ─ Dame el número mayor de fronteras (Give me the largest number of frontiers). ─ Muéstrame el mayor número de fronteras (Show me the greater number of frontiers). ─ De las fronteras ¿cuál es el número mayor? (Of frontiers what is the largest number?). When the query is analyzed in detail we note that the degree of difficulty to understand, through a language translation as those used in ILNBDs increases, which is why deserve aggregation functions will be analyzed from different points of view, before he could speak its implementation. Returning to the example of the query 4, we note that the user is requesting only one fact, the higher occurrence of earthquakes that have registered, to solve this query and not shed excess information or erroneous through the MAX aggregation function can show data that the user requests, see the equivalent query in SQL. SELECT MAX(earthquakes) FROM geography As we can see the grouping is very important, although it is clear that the examples are very simple due to the database that was used, but in companies where information is concentrated part of a large network of department stores and all information is stored in a single BD, means developing a complex analysis of the query that is being requested and include the necessary relationships, we see this with an example query in Spanish Natural Language: “Dame el número de trabajadores del departamento de carnes de las sucursales de la ciudad de México que tengan menos de 2 años de antigüedad”. (Give me the number of workers in the meat department of the branches of Mexico City with less than 2 years of antiquity) To solve the above query is first necessary to determine the relationship, in this case, conjoined entities to obtain the necessary information are empleados, departamentos, sucursales y ciudades (Employees, departments, branches and cities), then we have to consider whether need to use aggregate functions or Group By clause, for this case are necessary both, the SUM aggregate function to count the number of workers and the clause to group by department. This article is the beginning of the development of a master project that is planned in the Technological Institute of Ciudad Madero (ITCM), which aims to solve the translations of Spanish natural language queries on relational database to extend on a translation domains ILNBD. Some of the keywords that we will be analyzing when translating queries to identify the use of aggregate functions are shown in Table 2, in a column that is the word in Spanish and another NL aggregation function corresponding remembering that only show the main, however, the number will increase in our implementation because they consider all possible synonyms that exist in the Spanish LN and words or phrases that may arise. 6 Conclusions As we have seen throughout this article, the development of natural language interfaces that translate queries involving aggregate functions and GROUP BY clause requiring a good discussion and good solution strategy to allow correct translation query to be processed by the interface. Resolve issues important to natural language processing and applying them in NLI, enhances the domain of information that can be obtained any relational DBs. Table 2. Word analysis to use aggregate functions and GROUP BY clause Palabra/Frase (Word/Phrase) Cuantos Suma Promedio Media Máximo Mayor Mínimo Menor Todos los(as) El Total Agrupado Clasificado Función de Agregación/Cláusula (Aggregate Functions/Clause) COUNT SUM AVG AVG MAX MAX MIN MIN COUNT SUM GROUP BY GROUP BY References 1. Rojas J.C. Administrador de Diálogo para una Interfaz de Lenguaje Natural a Bases de Datos, 2009. 2. Androutsopoulos I., Ritchie G.D., Thanisch P. Natural Language Interfaces to Databases - An Introduction. Natural Language Engineering, 1995 3. Liddy D, Natural Language Processing for Information Retrieval & Knowledge Discovery, School of Information Studies, 2001. 4. Carme Martín Escofet, El lenguaje SQL.