Uploaded by vmurali.info

machine learning

advertisement
1. Introduction
Sport is being played for physical, mental, and social wellbeing as well as for entertainment
since ages; the history of Olympic Games is dated back to 776 BC with the modern Olympic
Games being started at Greece in 1896. Currently, the summer and winter Olympics Games are
held every four years and include 35 different sports: individual, combat and team sports. While
we are enjoying exciting and thrilling Olympics performances, the journey of an athlete has
already started long ago from talent identification and continues until competition performances.
It requires long term athlete development through the rigorous, systematic and scientific
continuous training process to improve athlete's physical and mental fitness, techniques and
tactics, sports specific skills throughout their sports journey.
A competitor can take part in at least one than one rivalry in a year. So the preparation is arranged in
stages with the end goal that the competitor can give the best presentation during rivalry. This procedure
of preparing arranging in stages is called 'Periodization' which was created by a physiologist Leo
Matveyev and a games researcher Tudor Bompa. Sports explicit preparing plans are set up according to
the individual competitor's necessities. So assessment of the competitor's presentation after each
preparation stage assists mentors with observing the competitor's advance and alter the preparation
whenever needed, to improve the exhibition of competitors. Subsequently ordinary wellness testing of
competitors is completed to assess their wellness and execution, and modify the program according to
the competitor's physical and game explicit expertise requests. Input of testing assessment is valuable
for the competitors to recognize and comprehend their qualities, the improvement accomplished
subsequent to preparing and their general situation when contrasted with the friends in the gathering at
public/global/Olympics level.
Customary observing and testing of competitors is additionally required for anticipating their
wellness and exhibitions during preparing, practice, and rivalries. Injury-related danger elements can
be concentrated through wellness assessment, biomechanical and step investigation and expectation
of wounds is hence useful in injury anticipation during preparing and rivalries. Prescient
investigation requires verifiable testing information for expectation of competitors' future wellness
and execution at various rivalries and assumes a crucial part in the determination of competitors for
the forthcoming rivalries.
Forecast of sports execution is most extreme valuable for the whole games environment which
incorporates competitors, mentors, actual coaches, sports researcher, sports medication subject
matter expert, physiotherapist, fans, sports establishments, partners, supports, media, sports makers
and brands.
Relapse models are generally utilized for foreseeing outcomes and Sir Francis Galton had first evolved
relapse model in 1886 to anticipate the tallness of grown-up kids from their parent's stature. In view
of Galton's hypothesis of innate of nature which assumes a significant part in athletic exhibition, sports
researcher accept the job of nature and sustain in competitor advancement. Hereditary qualities can't
be adjusted, henceforth researcher zeroed in on supporting the competitors through preparing,
nourishment, and the climate. Relapse models were utilized to comprehend the impact of preparing
on competitor's wellness, weariness, and afterward execution was anticipated.
The investigates were done in the field of physiology to understanding the connection among
physiological and actual wellness factors with the exhibition. Sports researchers additionally utilized
inferential measurable strategies in particular free t-test, single direction examination of fluctuation
(ANOVA) to figure out which physiological and actual elements were separating world class and
possible competitors. Positions dependent on the absolute z-scores were additionally used to identify
expected competitors. Direct relapse models were utilized to initially recognize critical indicators of
the exhibition and afterward use z-scores were registered utilizing the huge wellness factors.
Positions were given to the competitors dependent on these z-scores and furthermore the positions
were given by the mentors to the competitors. Further different relapse models were created utilizing
z-scores rather than the crude wellness information. Again the prescient precision of the model was
bad as these models depended on thinking about the direct relationship.
Studies were additionally completed to anticipate the vigorous perseverance of the competitors.
Direct relapse models were created to foresee the maximal oxygen utilization of competitors. These
models were additionally reprimanded by different specialists since the prescient precision was bad
as these models think about the straight relationship of physiological elements with the presentation.
There is a non-direct connection between physiological elements and execution. Subsequently
scientists improved these models by utilizing non-straight relationship and created non-direct relapse
models utilizing smoothing bend strategies. There is likewise an issue of multicollinearity in
physiological and actual wellness information. So to handle the issue of multicollinearity, the
primary part examination was utilized to distinguish the critical wellness segments and these
foremost segments were utilized in numerous relapse models to anticipate the games execution.
Every one of these investigates were cross-sectional and subsequently long haul impact of the
preparation on wellness and execution was of revenue.
Time-arrangement direct and non-straight models were created to foresee wellness and in the long
run the exhibition contemplating preparing impacts. It was seen that the non-direct time
arrangement models were the best-fitted models. Staggered arbitrary impacts models were likewise
evolved to recognize the wellness indicators and anticipate the presentation. A rehashed measure
ANOVA was additionally utilized for the longitudinal explores.
The games execution records data is consistently longitudinal and time-subordinate. Contenders
play contentions at their different ages all through their games calling. So the conventional time
course of action procedures are not reasonable since the data is of curvilinear nature. To overcome
this issue, as opposed to unrefined data, various components of the data are used for the examination
and are called viable data assessment. So the viable gathering computations were made to describe
time course of action bundling data. Hardly any investigation analyses were done to show the
presentation time using common sense data assessment in sports examination.
Sports advancement like GPS, video shooting of the preparation and contentions produces
tremendous data on bit by bit time improvements of contenders/sports gatherings/foes on presence,
volume, speed, distinctive health factors. Markov models were made to anticipate the introduction
in racket sports contemplating stochastic patterns of a contender's turn of events. Continuous and
post facto examination of such data is a significant task and AI models are amazingly important for
it. Bayesian models were used to plan a games strategy and anticipate match result using continuous
data during the live match. Actually AI models were used to perceive supreme and likely
contenders, to recognize the markers and anticipate the games displays using Hierarchical models,
gathering piece limits and sponsorship vector machines.
There are a couple of challenges in standard authentic procedures. The parametric systems need a
couple of assumptions to follow like usually appropriated factors, comparable people changes, and
quantitative data. Moreover, a tolerably colossal model is required for farsighted authentic showing.
It is difficult to get endless contenders' data on physiological and health testing. These procedures
used the data one-time and may not be a good thought to expect the consequence of the dark data.
So the accuracy of assumption using standard quantifiable models is low. There is a need to deal
with a little model size issue in sports assessment. There is moreover a need to address the issue of
non-linearity in wellbeing data.
Contemplating the constraints of standard genuine methods, there is a need to apply AI techniques
in sports assessment. The AI procedures use the data in parts for getting ready and testing the
estimation by drawing a couple of sporadic models from the data, another piece of the data can be
used for endorsement of the counts. The AI perceptive models are totally having benefits over the
standard verifiable procedures which use the data one time and subsequently more accurate when
appeared differently in relation to the quantifiable models. The AI models are data driven models
and conventional verifiable models are the systems driven models. There is a need to make data
driven models rather than the methodology driven models to predict sports presentations even more
absolutely. It will be furthermore fascinating to dissect customary quantifiable and AI insightful
models for judicious precision and the time adequacy.
Review of literature
1. Taha et al. (2018)
The purpose of the study was to develop machine learning algorithms for talent identification in
archery. The data were normalized using z-scores. The hierarchical agglomerative cluster analysis
(HACA) was applied to identify the high potential archers (HPA) or low potential archers
(LPA) using performance scores. Support Vector Machines (SVM) algorithms were developed
using six different classification kernel functions namely linear; polynomial (quadratic, cubic);
Radial Basis Function (fine; medium; low) (RBF) to identify optimal classifier based on fitness
and performance features.
2. Pantazopoulos and Maragoudakis (2018)
The study was conducted in 21 college students to predict running performances of 800 m and
5000 m events. Along with linear regression, several other machine learning algorithms were
used as base learning models namely Support Vector Machines (SVM); random forests; and
deep neural networks using “sci-kit learn” and “Kera” library of Python and “Hyperopt” module
of python was used for the process automation. Gradient Boosting Machines (GBM) was applied
to predict running time. GBM accurately predicted the running times than other machine learning
algorithms.
3. Leroy et al. (2018)
In this paper, authors developed functional data analysis methods for clustering for the detection
of potential young athletes using swimmers real-time performance data. To overcome
inconsistent longitudinal data, authors had used functions instead of raw data for modeling the
performance. Functional Principal Component Analysis (FPCA) and functional clustering
algorithms were used for the analysis. Curve clustering procedure was used to fit the model for
the identification of potential young swimmers using real data of performance progression curves.
4. Faber et al. (2018)
Univariate General Linear Model (GLM) was used to study differences in perceptuomotor skill
tests score (dependent variable) between primary students and youth table tennis players. Gender
was considered as a fixed variable and the age of the participants was considered as a covariate
in GLM. Various datasets were prepared using "Leave one out" approach and discriminant
analysis was carried out and validated on those datasets to study which battery items contribute
to the high/low performance of three groups.
5. Watson et al. (2017)
The case-control research was conducted to study the pre-season fitness predictors of season
injuries and illness. Kruskal Wallis test followed by post hoc Wilcoxon test was applied to study
within and between-group differences of injuries and illness with aerobic capacity VO2max and
Tmax. Effect sizes were calculated for the injured and uninjured group for all aerobic and other
variables. Univariate Poisson regression models were used to test the significant predictors of
season illnesses and injuries.
6. Ntai et al. (2017)
A multivariate analysis of variance (MANOVA) was applied to compare all anthropometric and
motor skills variables between the group's gender; 3 age groups; national/international
competition levels; and fencing events epee/foil/saber. Bonferroni post-hoc test was applied to
significant variables in MANOVA. Partial eta-squared η2 values were used to calculate the main
and interaction effect sizes.
7. Afonso et al (2017)
The recent review of 42 randomised controlled clinical trial found the issues related to the study
design, and data analysis of previous research studies done namely a) study duration of all
studies was short i.e. < 9 months and hence long-term effect of periodization could not be
estimated; b) most of the studies used mean and standard deviation to report the effect of the
training; c) comparison was done between different group of athletes; d) only 9 studies mapped
the impact of training using effect sizes; and e) not a single study included in the review reported
zero; low; and high training effect on athletes. This review also identified that none of the studies
predicted any direction; quantity; and timing of the periodization effect on the athlete's fitness.
8. Balagué et al. (2016)
The means of all the physiological tests were compared between the four groups using Friedman
ANOVA. Principal Component Analysis (PCA) was done using 6 selected cardiorespiratory
variables with varimax rotation in each group of students with repeated measures training period.
Tucker's congruence coefficient was applied to the PCs to compare them before training, after
training and detraining. Median of the first principal component (PC1) coefficient of congruence
was computed and compared across the four groups and 3 periods using non-parametric Cohen’s
d effect size was calculated for each group and period to study the differences due to the training.
9. Percy (2015)
Authors carried out deep learning models; decision analysis for these various stochastic
processes using subjective and objective model parameter's prior distribution and used sequential
Bayesian posterior distributions effectively updating the model parameters for outcome
prediction during the live match in real time. Authors used university boat race data as a case
study and explained how sequential Bayesian posterior distributions updated during the race
using Bernoulli processes and a discrete time Markov models for race timing outcome
prediction. Authors also expressed the need to conduct more research on time-varying prediction
modeling in sports.
10. Demirkan et al. (2015)
One-way analysis of variance was used to compare the differences between the groups
lightweight elite; lightweight amateur; middleweight elite; middleweight amateur; heavyweight
elite; and heavyweight amateur wrestlers. Tukey’s test was used for post-hoc comparisons. A
multivariable logistic regression analysis was used to study the physical and physiological
predictors of wrestling success. The prediction accuracy of the logistic regression model was
84.8%.
11. Bidaurrazaga-Letona et al. (2015)
A multi-level model was developed using within individual athlete repeated measures of physical
fitness tests at level 1 and individual athletes at level 2. All independent variables were log
transformed due to the curvilinear relationship. Baseline model without somatic maturity status
and the interaction term was developed; while to study the effect of maturity and interaction, the
second model was developed by adding these variables in the model. Baseline model and the
model with added two variables of maturity were tested and validated using variance explained
and Akaike Information Criteria (AIC).
12. Turner (2014)
An individual-level athleticism composite index was developed using fitness battery tests results
for the selection of fencers. Z-scores were calculated for each physical and motor skill tests and
"Total Score of Athleticism (TSA)” index was calculated by adding the z-scores of all the tests
applied on an individual athlete. Further ranks were given based on athlete’s TSA score; athletes
with higher TSA score were given higher rank. Finally, the athletes were classified as poor,
average and best athlete based on their ranks.
Objectives
The objectives of the proposed research are to use conventional statistical and machine learning
methods

To develop models for identification and selection of sports performance predictors

To develop models for fitness and performance evaluations

To develop models for talent prediction and selection

To compare conventional statistical and machine learning predictive models
1. Methodology
Secondary data of individual athlete level and event level data will be used for the proposed
research. Periodic fitness and sports-specific skills data will also be collected for each athlete.
Periodic testing of athlete’s fitness and skills is done generally every quarter of the year as per
training phases of the sports discipline. Each athlete is tested one to four times in a year as per
requirement of the training and monitoring.
Individual athlete’s level data
Athlete’s data includes demographics such as age, education and native place of the athlete;
sports-specific data such as sports discipline, event, weight category (if applicable), training age
and sports age.
Fitness and motor skill testing include Human Performance (HP) laboratory and field testing data
of each athlete. HP laboratory testing data includes aerobic capacity i.e. the volume of oxygen
consumption of athletes and anaerobic capacity i.e. the power of athletes. Field testing data
includes height, weight, speed, agility, flexibility, muscular strength and endurance, motor skills
and sports specific skill of athletes.
Performance data included the participation of athlete in the various competitions in a year, type
of competition such that state or national or international level competition, type and number
medals won in the competitions by each athlete.
Event level data
Training data is available at sports specific event level and not on an individual athlete level.
Training plans data about training cycles, duration of those cycles, training phases, training
intensity and volume of each phase will be used for the data analysis.
Independent variables
Independent variables such as demographics, sports-specific details, training, fitness, and skills
of athletes will be compiled using various datasets. Fitness index will be compiled using fitness
and sports-specific skill variables. Fitness index will also be treated as an independent variable.
Dependent/ Outcome variables
Two outcome variables of the athlete's actual performance will be compiled using competitions
and medals data are as follows
-
Outcome 1: Athlete’s status - winner/looser/not participated
-
Outcome 2: Aggregate performance index of each athlete
Data will be transformed, cleaned, and missing values will be replaced by using suitable statistical
methods. All the analysis datasets will be prepared and merged using the athlete's unique
identification number.
An exploratory analysis will be carried out first to understand the data. Normality of the
quantitative variables will be tested using the Kolmogorov-Smirnov test. Levene’s test will be
used for testing equality of variances. Parametric tests will be used for the analysis of normally
distributed quantitative variables. Non-parametric tests will be used for non-normal and
qualitative variables. Statistical significance will be set as p ≤ 0.05 and the analysis will be done
in the R software environment.
Conventional statistical and machine learning methods will be used for the data analysis listed
but not limited are as follows
a) Conventional statistical methods

Paired/ unpaired t-test, one-way and repeated measures analysis of variance, Chisquare test

Principal component analysis (PCA)

Discriminant analysis



Linear/ multiple correlation and regression
Univariate/ multiple logistic regression, polynomial regression
General linear models

random effect multilevel models

Time-series

Bayesian inference

Markov modelling
b) Machine learning methods

Linear/ multiple regression

Univariate/ multiple logistic regression, polynomial regression

PCA and PCA regression

Classification and regression trees

Decision trees

Support vector machines

Random forest

K-mean clustering

Neural network

Time series
Predictive models once developed using statistical and machine learning methods will be compared in
terms of assumptions, minimal sufficiency of sample size required, estimated errors, accuracy, sensitivity,
specificity, and the time efficiency.
Scope of the work
The proposed study will use the machine learning methods for developing predictive models that
are not commonly used in sports analyses. It will be helpful to the coaches and athletes in
training as per periodization and improving sports performance. The predictive modeling will be
useful to the coaches, sports scientists and decision makers to select and develop the best athletes
for the sport.
2. Work plan
Year 1

Identify a research problem

Review of literature for identification of existing knowledge, strengths and gaps in the
previous research

Develop a research proposal

Attend Ph.D. coursework, write assignments, and complete the coursework

Prepare synopsis and presentation
Year 2

Present synopsis at doctoral research committee

Data collection, compilation, management, and preparing analysis datasets

Data analysis and interpretation of the results

Review of literature for the latest development in the research areas

Prepare and submit the summary of research

Prepare manuscripts and present papers at national/ international conferences
Year 3

Complete the thesis writing and submit it for plagiarism

Prepare and appear for Pre Ph.D. Viva

Modify the thesis and submit the final thesis

Prepare and defend final Viva
Download