Art03_E

advertisement
Determining Hidden Behavior Patterns in a Youth
Olympics in México Using Data Mining
Guadalupe Gutiérrez1,, Lourdes Margain1, Alberto Ochoa2, and Alejandro de Luna1
Universidad Politécnica de Aguascalientes (Mexico)
Universidad Autónoma de Ciudad Juárez (Mexico)
{guadalupe.gutierrez, lourdes.margain, alejandro.deluna}@upa.edu.mx,
alberto.ochoa@uacj.mx
1
2
Abstract. In this paper a pre-analysis is presented to obtain hidden behavior
patterns using data mining on the Centro Nacional de Información y de Cultura
Física y Deporte de México’s data base to know which sport should be
supported the most in the state of Aguascalientes to compete in the Olympics,
and to identify the gender that has shown a better performance in the present
year.
Keywords: data mining, hidden behavior patterns, Aguascalientes sport.
1
Introduction
Nowadays some sports companies (e.g. Nike, Reebok) perform a diverse amount of
studies to determine the best possible ways to promote their sporting goods to
increase sales. Likewise sport institutes (e.g. CONADE) are looking to identify which
athletes to support or which sport to promote in the Olympics. Data bases are one of
the tools most used to get attributes, preferences and even skills of an individual
through data mining technique.
According to Argarwal [1] record located in each user’s data base has a line of
hundreds of attributes that make up a complete frame of a person, this attribute chains
are called high dimensional data and it’s where the data mining is used to explore and
explode the information. With data mining is possible to perform an extraction of
implicit information that could have been ignored and yet could be potentially useful
to create consistent behavior patterns or relations.
One of the areas where data mining could be applied is in the sport field where
every four years a budget is assign of about 19,000,000 USD to assists athletes with
financial difficulties that have been selected by their national Olympics committee in
the preparation and qualification for the Olympics games11.
An analysis is presented in this paper applying data mining to obtain hidden
behavior patterns to predict which would be the sport that should be supported the
most in the city of Aguascalientes to compete in the Olympics. It also seeks to
1
http://www.ittf.com/ittf_development/PDF/London%202012%20-%20Directrices.pdf
identify the gender performances in the sports that have obtained the most gold metals
in 2012.
2
Data Mining
Data mining is based on exploring large amounts of data that are related with each
other to find hidden information or behavior patterns predicting new tendencies [2,3].
To create analytical models of information data mining can use statistical models,
linear regression, mathematical algorithms, pattern recognition and machine learning
methods [4].
Data mining can be visualized through analytical models raised to explore large
amounts of information to find relationships between variables to apply new data sets
[5]. The data mining process can be observed in figure 1, the steps are described
below.
Fig. 1. Data mining process

Step 1: Domain Analysis. To perform an analysis with data mining first of
all it’s necessary to comprehend the domain of the application, the data base
and the user’s objectives.

Step 2. Data selection. Choose the data to use and identify the possible
variables or attributes to analyze the data base to perform the process of
discovery.

Step 3. Data processing. Start the prosecution of the data with the
appropriate strategies to control or minimize the noise or the incomplete
values that can be in the data.

Step 4: Transformation. Reduction of data and projections to reduce the
number of variables used.

Step 5: Identifying the method of discovery. Identify the discovery method
to use (eg classification, clustering, regression). Select also the algorithm
used to perform the data mining process.

Step 6: Interpretation. Repeat the process to interpret the results if there is
a significant knowledge or improved.

Step 7: Knowledge. Add the discovered knowledge system to obtain an
improvement.
Data mining is related to the knowledge discovery in the data bases called
Knowledge Data Discovery (KDD) and can be based on artificial intelligence
techniques , statistics techniques, predictive or descriptive (see figure 2), which are
algorithms applied to a set of data with the purpose to obtain a result.
Fig. 2. Data mining techniques
To accomplish the analysis applying data mining through Weka tool the Centro
Nacional de Información y de Cultura Física y Deporte de México’s data base will be
used, this contains the results of the national Olympics 2.
3
Data Mining Tools
Nowadays variety of tools exists to perform data mining like: AC2, AnswerTree,
CART, Clementine, Elvira y Weka.
Weka [6] acronym for Waikato Environment for Knowledge Analysis, is a free
software and open source under the general public license of GNU (GLP), is an
environment that allows the application and evaluation of techniques most relevant to
the data analysis, mainly the ones from automatic learning, over any set of data. Weka
is the selected tool in this paper to perform classification tests focus on algorithms
based on rules and the finding of behavior patterns hidden in the data bases of the
Centro Nacional de Información y de Cultura Física y Deporte de México.
2
http://on2012.deporte.org.mx/ResultadosOlimpiada.aspx
4
Data Analysis
One of the first steps to perform this process is to determine the questions we want to
answer with the data base analysis applying data mining using Weka. There are recent
investigations that seek to identify what is the reason why individuals develop or are
born with the ability to play a sport, some of the answers are based on the following
variables that according to studies of the physiological bases, and social studies [7]
determine that the reason could be: genetic bases, eating habits, geographical theory,
morphological reasons, lung capacity, extremity, muscular metabolism, social-cultural
theory, muscle tissue. This paper is seeking to answer the following questions.

Which is the sport in the state of Aguascalientes in which there has been
more gold medals won?

What sport is the one where genders show a better performance?
According to a surveys applied to investigators related to data mining, it
determines the variables that can have an influence on a good performance of the
athletes from Aguascalientes (See table 1).
Table 1. Variables that influence on athlete’s performance
Variable
Geographical theory
society
Pygmalion effect
Temporality
Morphological reasons
5
Description
In this variable it influences the environment that
possesses the place, that is to say, if the place has a sea,
forest, or has low, medium, or high latitude.
This variable defends that regardless of a genetic
endowed for sports, it would be useless without being
supported by an optimal training and practice.
This variable refers to the event by which a person gets
what is proposed because of the confidence in their
performance.
This variable refers that there is a duration of time , it’s
not set or permanent, there is a transience of things.
This variable is a reference to the athletes that have less
subcutaneous fat in the arms and legs, and a body and a
muscle proportionally finer, with wider shoulders, thicker
quadriceps, and in general a more developed musculature.
Results
For the information analysis it was necessary to determine the sport that stands out in
the state of Aguascalientes, as you can see, in figure 3 shows in a graphical way that
cycling is the sport that has obtains the most medals in the last 5 years, followed by
swimming and track. The 2nd sport with the most victories has caused the following
question. Is it possible that swimming could generate more victories in states closer to
the sea? Figure 4 and 5 shows a graph where it shows that states closer to the sea
aren’t necessarily better in swimming.
Fig. 3. Sport’s achievements in Aguascalientes
Figure 4 shows that the States do not necessarily have sea are best in swimming,
because the state of Mexico, followed by the state of Jalisco are in first place to get
gold medals in the sport. You can even view the state of Aguascalientes is two places
higher than the state of Quintana Roo, which has a variety of beaches.
Fig. 4. Gold medals in swimming in Mexico
Even though in Weka tool, the swimming is seen as the sport that has won the most
medals, cycling is the one that has obtain the most with a total of 35 gold metals
unlike the 32 gold medals that are registered in swimming.
Fig. 5. Sport’s gold medals in Aguascalientes
Now to analyze what is the genre that has worked best in those sports in 2012, is
presented in Table 2, which shows the number of gold medals won in each of the
sports of the Youth Olympics and the amount of athletes who participated, also shows
that the female gender is what has worked best getting more gold medals.
Table 2. Influence variables in the athlete’s performance
SPORT
GOLD'12 C-ATHLETES WOMEN MEN W-TEAM M-TEMA
swimming
4
6
2
1
0
0
cycling
5
11
3
0
1
0
athletics
0
1
squash
3
1
Weightlifting
3
1
Taekwondo
1
4
Open Water
0
1
boxing
1
2
2
1
Now applying the linear regression formula 1 approximation is determined to learn
how many medals will be won in the next year, this process is shown by a scatter
diagram shown in Figure 6. The line obtained by linear regression represents a trend
in a series of data obtained through a period. Such lines can tell us whether a
particular data set will increase or decrease in a given time. In this case indicates that
if the same number of athletes attending the Olympics next year in youth sports
cycling, athletics, taekwondo and OPEN WATER have more wins, otherwise another
as squash and weightlifting where the victories expected decrease, in boxing
swimming and is forecast remain the same. It should be noted that this analysis when
applied to data from the year 2012, so it is proposed to follow later in greater depth
using the technique of case-based reasoning to determine when an athlete could win
the medal gold.
(1)
Fig. 6. sport’s gold medals in Aguascalientes
6
Conclusions
Answering the first question, usually you would think that athletes from states with
maritime zones are those who excel in water sports like swimming, but as shown in
the records of the CNICFDE not necessarily true, so in this work determines that the
variable geographical theory is not a determining factor in affirming that states have
beaches are better at swimming. Cycling in the state of Aguascalientes has emerged
as a sport in which athletes have won more gold medals, by which the variable society
of influence is considered in this paper, because the sport is promoted in this state
with programs such as water with the bike3. The variable temporality would be ruled
out because the sport in the last five years has continued to generate wins. The
Pygmalion effect variable is one of the most difficult to prove because you must know
the mentality of the athlete as well as the surrounding environment from which to
determine whether there is sufficient motivation to win in a sport. Moreover variable
3
http://aguasconlabici.wordpress.com/
morphological reasons in this work cannot be said to be a cause for medals because
the database does not show information about these features from each of the athletes.
In this way the analysis has been presented with the data mining technique to find
hidden patterns of behavior that allow us to know that cycling should be given further
support in the last five years has generated more gold medals unlike other sports in
the state of Aguascalientes. Future work is planned to develop a data warehouse fed
records medals won in the last five years in the youth Olympics as well as information
about where athletes practice, the environment in which they live and even their
morphological characteristics. And with the application of data mining technique
combined with case-based reasoning, one could make a deeper and more detailed
study to predict victories at the Olympics in the region.
References
1. Agarwal, A.P., J.; Venkatasubramanian, S., Universal Multi-Dimensional Scaling. The 16th
ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2010.
2. Bigus, J., Data Mining With Neural Networks, McGraw-Hill, Editor. 1996.
3. Berry, M.L., G., Data Mining Techniques, ed. I. Wiley Publishing. 2004.
4. Ochoa, A.R., G.; Bañuelos, F.; Mendhizavili, K.; Iztebegovič, H.; Hal, S., Herramienta
inteligente para la toma de decisiones basada en Minería de Datos, in XI JORNADAS DE
INVESTIGACIÓN, Revista Investigación Científica. 2007.
5. Fayyad U.; Piatetsky-Shapiro, G., Advances in Knowledge Discovery and Data Mining ed.
A. PRESS. 1996.
6. Frank, I.H.W.y.E., Data Mining: Practical machine learning tools and techniques. 2nd
Edition., ed. M.K.S. Francisco. 2005.
7. Rodríguez, M.N., A., La superioridad de los atletas africanos en las pruebas de resistencia,
in EFDeportes.com, Revista Digital. Buenos Aires, Año 15, Nº 148. 2010.
Download