Determining Hidden Behavior Patterns in a Youth Olympics in México Using Data Mining Guadalupe Gutiérrez1,, Lourdes Margain1, Alberto Ochoa2, and Alejandro de Luna1 Universidad Politécnica de Aguascalientes (Mexico) Universidad Autónoma de Ciudad Juárez (Mexico) {guadalupe.gutierrez, lourdes.margain, alejandro.deluna}@upa.edu.mx, alberto.ochoa@uacj.mx 1 2 Abstract. In this paper a pre-analysis is presented to obtain hidden behavior patterns using data mining on the Centro Nacional de Información y de Cultura Física y Deporte de México’s data base to know which sport should be supported the most in the state of Aguascalientes to compete in the Olympics, and to identify the gender that has shown a better performance in the present year. Keywords: data mining, hidden behavior patterns, Aguascalientes sport. 1 Introduction Nowadays some sports companies (e.g. Nike, Reebok) perform a diverse amount of studies to determine the best possible ways to promote their sporting goods to increase sales. Likewise sport institutes (e.g. CONADE) are looking to identify which athletes to support or which sport to promote in the Olympics. Data bases are one of the tools most used to get attributes, preferences and even skills of an individual through data mining technique. According to Argarwal [1] record located in each user’s data base has a line of hundreds of attributes that make up a complete frame of a person, this attribute chains are called high dimensional data and it’s where the data mining is used to explore and explode the information. With data mining is possible to perform an extraction of implicit information that could have been ignored and yet could be potentially useful to create consistent behavior patterns or relations. One of the areas where data mining could be applied is in the sport field where every four years a budget is assign of about 19,000,000 USD to assists athletes with financial difficulties that have been selected by their national Olympics committee in the preparation and qualification for the Olympics games11. An analysis is presented in this paper applying data mining to obtain hidden behavior patterns to predict which would be the sport that should be supported the most in the city of Aguascalientes to compete in the Olympics. It also seeks to 1 http://www.ittf.com/ittf_development/PDF/London%202012%20-%20Directrices.pdf identify the gender performances in the sports that have obtained the most gold metals in 2012. 2 Data Mining Data mining is based on exploring large amounts of data that are related with each other to find hidden information or behavior patterns predicting new tendencies [2,3]. To create analytical models of information data mining can use statistical models, linear regression, mathematical algorithms, pattern recognition and machine learning methods [4]. Data mining can be visualized through analytical models raised to explore large amounts of information to find relationships between variables to apply new data sets [5]. The data mining process can be observed in figure 1, the steps are described below. Fig. 1. Data mining process Step 1: Domain Analysis. To perform an analysis with data mining first of all it’s necessary to comprehend the domain of the application, the data base and the user’s objectives. Step 2. Data selection. Choose the data to use and identify the possible variables or attributes to analyze the data base to perform the process of discovery. Step 3. Data processing. Start the prosecution of the data with the appropriate strategies to control or minimize the noise or the incomplete values that can be in the data. Step 4: Transformation. Reduction of data and projections to reduce the number of variables used. Step 5: Identifying the method of discovery. Identify the discovery method to use (eg classification, clustering, regression). Select also the algorithm used to perform the data mining process. Step 6: Interpretation. Repeat the process to interpret the results if there is a significant knowledge or improved. Step 7: Knowledge. Add the discovered knowledge system to obtain an improvement. Data mining is related to the knowledge discovery in the data bases called Knowledge Data Discovery (KDD) and can be based on artificial intelligence techniques , statistics techniques, predictive or descriptive (see figure 2), which are algorithms applied to a set of data with the purpose to obtain a result. Fig. 2. Data mining techniques To accomplish the analysis applying data mining through Weka tool the Centro Nacional de Información y de Cultura Física y Deporte de México’s data base will be used, this contains the results of the national Olympics 2. 3 Data Mining Tools Nowadays variety of tools exists to perform data mining like: AC2, AnswerTree, CART, Clementine, Elvira y Weka. Weka [6] acronym for Waikato Environment for Knowledge Analysis, is a free software and open source under the general public license of GNU (GLP), is an environment that allows the application and evaluation of techniques most relevant to the data analysis, mainly the ones from automatic learning, over any set of data. Weka is the selected tool in this paper to perform classification tests focus on algorithms based on rules and the finding of behavior patterns hidden in the data bases of the Centro Nacional de Información y de Cultura Física y Deporte de México. 2 http://on2012.deporte.org.mx/ResultadosOlimpiada.aspx 4 Data Analysis One of the first steps to perform this process is to determine the questions we want to answer with the data base analysis applying data mining using Weka. There are recent investigations that seek to identify what is the reason why individuals develop or are born with the ability to play a sport, some of the answers are based on the following variables that according to studies of the physiological bases, and social studies [7] determine that the reason could be: genetic bases, eating habits, geographical theory, morphological reasons, lung capacity, extremity, muscular metabolism, social-cultural theory, muscle tissue. This paper is seeking to answer the following questions. Which is the sport in the state of Aguascalientes in which there has been more gold medals won? What sport is the one where genders show a better performance? According to a surveys applied to investigators related to data mining, it determines the variables that can have an influence on a good performance of the athletes from Aguascalientes (See table 1). Table 1. Variables that influence on athlete’s performance Variable Geographical theory society Pygmalion effect Temporality Morphological reasons 5 Description In this variable it influences the environment that possesses the place, that is to say, if the place has a sea, forest, or has low, medium, or high latitude. This variable defends that regardless of a genetic endowed for sports, it would be useless without being supported by an optimal training and practice. This variable refers to the event by which a person gets what is proposed because of the confidence in their performance. This variable refers that there is a duration of time , it’s not set or permanent, there is a transience of things. This variable is a reference to the athletes that have less subcutaneous fat in the arms and legs, and a body and a muscle proportionally finer, with wider shoulders, thicker quadriceps, and in general a more developed musculature. Results For the information analysis it was necessary to determine the sport that stands out in the state of Aguascalientes, as you can see, in figure 3 shows in a graphical way that cycling is the sport that has obtains the most medals in the last 5 years, followed by swimming and track. The 2nd sport with the most victories has caused the following question. Is it possible that swimming could generate more victories in states closer to the sea? Figure 4 and 5 shows a graph where it shows that states closer to the sea aren’t necessarily better in swimming. Fig. 3. Sport’s achievements in Aguascalientes Figure 4 shows that the States do not necessarily have sea are best in swimming, because the state of Mexico, followed by the state of Jalisco are in first place to get gold medals in the sport. You can even view the state of Aguascalientes is two places higher than the state of Quintana Roo, which has a variety of beaches. Fig. 4. Gold medals in swimming in Mexico Even though in Weka tool, the swimming is seen as the sport that has won the most medals, cycling is the one that has obtain the most with a total of 35 gold metals unlike the 32 gold medals that are registered in swimming. Fig. 5. Sport’s gold medals in Aguascalientes Now to analyze what is the genre that has worked best in those sports in 2012, is presented in Table 2, which shows the number of gold medals won in each of the sports of the Youth Olympics and the amount of athletes who participated, also shows that the female gender is what has worked best getting more gold medals. Table 2. Influence variables in the athlete’s performance SPORT GOLD'12 C-ATHLETES WOMEN MEN W-TEAM M-TEMA swimming 4 6 2 1 0 0 cycling 5 11 3 0 1 0 athletics 0 1 squash 3 1 Weightlifting 3 1 Taekwondo 1 4 Open Water 0 1 boxing 1 2 2 1 Now applying the linear regression formula 1 approximation is determined to learn how many medals will be won in the next year, this process is shown by a scatter diagram shown in Figure 6. The line obtained by linear regression represents a trend in a series of data obtained through a period. Such lines can tell us whether a particular data set will increase or decrease in a given time. In this case indicates that if the same number of athletes attending the Olympics next year in youth sports cycling, athletics, taekwondo and OPEN WATER have more wins, otherwise another as squash and weightlifting where the victories expected decrease, in boxing swimming and is forecast remain the same. It should be noted that this analysis when applied to data from the year 2012, so it is proposed to follow later in greater depth using the technique of case-based reasoning to determine when an athlete could win the medal gold. (1) Fig. 6. sport’s gold medals in Aguascalientes 6 Conclusions Answering the first question, usually you would think that athletes from states with maritime zones are those who excel in water sports like swimming, but as shown in the records of the CNICFDE not necessarily true, so in this work determines that the variable geographical theory is not a determining factor in affirming that states have beaches are better at swimming. Cycling in the state of Aguascalientes has emerged as a sport in which athletes have won more gold medals, by which the variable society of influence is considered in this paper, because the sport is promoted in this state with programs such as water with the bike3. The variable temporality would be ruled out because the sport in the last five years has continued to generate wins. The Pygmalion effect variable is one of the most difficult to prove because you must know the mentality of the athlete as well as the surrounding environment from which to determine whether there is sufficient motivation to win in a sport. Moreover variable 3 http://aguasconlabici.wordpress.com/ morphological reasons in this work cannot be said to be a cause for medals because the database does not show information about these features from each of the athletes. In this way the analysis has been presented with the data mining technique to find hidden patterns of behavior that allow us to know that cycling should be given further support in the last five years has generated more gold medals unlike other sports in the state of Aguascalientes. Future work is planned to develop a data warehouse fed records medals won in the last five years in the youth Olympics as well as information about where athletes practice, the environment in which they live and even their morphological characteristics. And with the application of data mining technique combined with case-based reasoning, one could make a deeper and more detailed study to predict victories at the Olympics in the region. References 1. Agarwal, A.P., J.; Venkatasubramanian, S., Universal Multi-Dimensional Scaling. The 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2010. 2. Bigus, J., Data Mining With Neural Networks, McGraw-Hill, Editor. 1996. 3. Berry, M.L., G., Data Mining Techniques, ed. I. Wiley Publishing. 2004. 4. Ochoa, A.R., G.; Bañuelos, F.; Mendhizavili, K.; Iztebegovič, H.; Hal, S., Herramienta inteligente para la toma de decisiones basada en Minería de Datos, in XI JORNADAS DE INVESTIGACIÓN, Revista Investigación Científica. 2007. 5. Fayyad U.; Piatetsky-Shapiro, G., Advances in Knowledge Discovery and Data Mining ed. A. PRESS. 1996. 6. Frank, I.H.W.y.E., Data Mining: Practical machine learning tools and techniques. 2nd Edition., ed. M.K.S. Francisco. 2005. 7. Rodríguez, M.N., A., La superioridad de los atletas africanos en las pruebas de resistencia, in EFDeportes.com, Revista Digital. Buenos Aires, Año 15, Nº 148. 2010.