machine learning

1. Introduction Sport is being played for physical, mental, and social wellbeing as well as for entertainment since ages; the history of Olympic Games is dated back to 776 BC with the modern Olympic Games being started at Greece in 1896. Currently, the summer and winter Olympics Games are held every four years and include 35 different sports: individual, combat and team sports. While we are enjoying exciting and thrilling Olympics performances, the journey of an athlete has already started long ago from talent identification and continues until competition performances. It requires long term athlete development through the rigorous, systematic and scientific continuous training process to improve athlete's physical and mental fitness, techniques and tactics, sports specific skills throughout their sports journey. A competitor can take part in at least one than one rivalry in a year. So the preparation is arranged in stages with the end goal that the competitor can give the best presentation during rivalry. This procedure of preparing arranging in stages is called 'Periodization' which was created by a physiologist Leo Matveyev and a games researcher Tudor Bompa. Sports explicit preparing plans are set up according to the individual competitor's necessities. So assessment of the competitor's presentation after each preparation stage assists mentors with observing the competitor's advance and alter the preparation whenever needed, to improve the exhibition of competitors. Subsequently ordinary wellness testing of competitors is completed to assess their wellness and execution, and modify the program according to the competitor's physical and game explicit expertise requests. Input of testing assessment is valuable for the competitors to recognize and comprehend their qualities, the improvement accomplished subsequent to preparing and their general situation when contrasted with the friends in the gathering at public/global/Olympics level. Customary observing and testing of competitors is additionally required for anticipating their wellness and exhibitions during preparing, practice, and rivalries. Injury-related danger elements can be concentrated through wellness assessment, biomechanical and step investigation and expectation of wounds is hence useful in injury anticipation during preparing and rivalries. Prescient investigation requires verifiable testing information for expectation of competitors' future wellness and execution at various rivalries and assumes a crucial part in the determination of competitors for the forthcoming rivalries. Forecast of sports execution is most extreme valuable for the whole games environment which incorporates competitors, mentors, actual coaches, sports researcher, sports medication subject matter expert, physiotherapist, fans, sports establishments, partners, supports, media, sports makers and brands. Relapse models are generally utilized for foreseeing outcomes and Sir Francis Galton had first evolved relapse model in 1886 to anticipate the tallness of grown-up kids from their parent's stature. In view of Galton's hypothesis of innate of nature which assumes a significant part in athletic exhibition, sports researcher accept the job of nature and sustain in competitor advancement. Hereditary qualities can't be adjusted, henceforth researcher zeroed in on supporting the competitors through preparing, nourishment, and the climate. Relapse models were utilized to comprehend the impact of preparing on competitor's wellness, weariness, and afterward execution was anticipated. The investigates were done in the field of physiology to understanding the connection among physiological and actual wellness factors with the exhibition. Sports researchers additionally utilized inferential measurable strategies in particular free t-test, single direction examination of fluctuation (ANOVA) to figure out which physiological and actual elements were separating world class and possible competitors. Positions dependent on the absolute z-scores were additionally used to identify expected competitors. Direct relapse models were utilized to initially recognize critical indicators of the exhibition and afterward use z-scores were registered utilizing the huge wellness factors. Positions were given to the competitors dependent on these z-scores and furthermore the positions were given by the mentors to the competitors. Further different relapse models were created utilizing z-scores rather than the crude wellness information. Again the prescient precision of the model was bad as these models depended on thinking about the direct relationship. Studies were additionally completed to anticipate the vigorous perseverance of the competitors. Direct relapse models were created to foresee the maximal oxygen utilization of competitors. These models were additionally reprimanded by different specialists since the prescient precision was bad as these models think about the straight relationship of physiological elements with the presentation. There is a non-direct connection between physiological elements and execution. Subsequently scientists improved these models by utilizing non-straight relationship and created non-direct relapse models utilizing smoothing bend strategies. There is likewise an issue of multicollinearity in physiological and actual wellness information. So to handle the issue of multicollinearity, the primary part examination was utilized to distinguish the critical wellness segments and these foremost segments were utilized in numerous relapse models to anticipate the games execution. Every one of these investigates were cross-sectional and subsequently long haul impact of the preparation on wellness and execution was of revenue. Time-arrangement direct and non-straight models were created to foresee wellness and in the long run the exhibition contemplating preparing impacts. It was seen that the non-direct time arrangement models were the best-fitted models. Staggered arbitrary impacts models were likewise evolved to recognize the wellness indicators and anticipate the presentation. A rehashed measure ANOVA was additionally utilized for the longitudinal explores. The games execution records data is consistently longitudinal and time-subordinate. Contenders play contentions at their different ages all through their games calling. So the conventional time course of action procedures are not reasonable since the data is of curvilinear nature. To overcome this issue, as opposed to unrefined data, various components of the data are used for the examination and are called viable data assessment. So the viable gathering computations were made to describe time course of action bundling data. Hardly any investigation analyses were done to show the presentation time using common sense data assessment in sports examination. Sports advancement like GPS, video shooting of the preparation and contentions produces tremendous data on bit by bit time improvements of contenders/sports gatherings/foes on presence, volume, speed, distinctive health factors. Markov models were made to anticipate the introduction in racket sports contemplating stochastic patterns of a contender's turn of events. Continuous and post facto examination of such data is a significant task and AI models are amazingly important for it. Bayesian models were used to plan a games strategy and anticipate match result using continuous data during the live match. Actually AI models were used to perceive supreme and likely contenders, to recognize the markers and anticipate the games displays using Hierarchical models, gathering piece limits and sponsorship vector machines. There are a couple of challenges in standard authentic procedures. The parametric systems need a couple of assumptions to follow like usually appropriated factors, comparable people changes, and quantitative data. Moreover, a tolerably colossal model is required for farsighted authentic showing. It is difficult to get endless contenders' data on physiological and health testing. These procedures used the data one-time and may not be a good thought to expect the consequence of the dark data. So the accuracy of assumption using standard quantifiable models is low. There is a need to deal with a little model size issue in sports assessment. There is moreover a need to address the issue of non-linearity in wellbeing data. Contemplating the constraints of standard genuine methods, there is a need to apply AI techniques in sports assessment. The AI procedures use the data in parts for getting ready and testing the estimation by drawing a couple of sporadic models from the data, another piece of the data can be used for endorsement of the counts. The AI perceptive models are totally having benefits over the standard verifiable procedures which use the data one time and subsequently more accurate when appeared differently in relation to the quantifiable models. The AI models are data driven models and conventional verifiable models are the systems driven models. There is a need to make data driven models rather than the methodology driven models to predict sports presentations even more absolutely. It will be furthermore fascinating to dissect customary quantifiable and AI insightful models for judicious precision and the time adequacy. Review of literature 1. Taha et al. (2018) The purpose of the study was to develop machine learning algorithms for talent identification in archery. The data were normalized using z-scores. The hierarchical agglomerative cluster analysis (HACA) was applied to identify the high potential archers (HPA) or low potential archers (LPA) using performance scores. Support Vector Machines (SVM) algorithms were developed using six different classification kernel functions namely linear; polynomial (quadratic, cubic); Radial Basis Function (fine; medium; low) (RBF) to identify optimal classifier based on fitness and performance features. 2. Pantazopoulos and Maragoudakis (2018) The study was conducted in 21 college students to predict running performances of 800 m and 5000 m events. Along with linear regression, several other machine learning algorithms were used as base learning models namely Support Vector Machines (SVM); random forests; and deep neural networks using “sci-kit learn” and “Kera” library of Python and “Hyperopt” module of python was used for the process automation. Gradient Boosting Machines (GBM) was applied to predict running time. GBM accurately predicted the running times than other machine learning algorithms. 3. Leroy et al. (2018) In this paper, authors developed functional data analysis methods for clustering for the detection of potential young athletes using swimmers real-time performance data. To overcome inconsistent longitudinal data, authors had used functions instead of raw data for modeling the performance. Functional Principal Component Analysis (FPCA) and functional clustering algorithms were used for the analysis. Curve clustering procedure was used to fit the model for the identification of potential young swimmers using real data of performance progression curves. 4. Faber et al. (2018) Univariate General Linear Model (GLM) was used to study differences in perceptuomotor skill tests score (dependent variable) between primary students and youth table tennis players. Gender was considered as a fixed variable and the age of the participants was considered as a covariate in GLM. Various datasets were prepared using "Leave one out" approach and discriminant analysis was carried out and validated on those datasets to study which battery items contribute to the high/low performance of three groups. 5. Watson et al. (2017) The case-control research was conducted to study the pre-season fitness predictors of season injuries and illness. Kruskal Wallis test followed by post hoc Wilcoxon test was applied to study within and between-group differences of injuries and illness with aerobic capacity VO2max and Tmax. Effect sizes were calculated for the injured and uninjured group for all aerobic and other variables. Univariate Poisson regression models were used to test the significant predictors of season illnesses and injuries. 6. Ntai et al. (2017) A multivariate analysis of variance (MANOVA) was applied to compare all anthropometric and motor skills variables between the group's gender; 3 age groups; national/international competition levels; and fencing events epee/foil/saber. Bonferroni post-hoc test was applied to significant variables in MANOVA. Partial eta-squared η2 values were used to calculate the main and interaction effect sizes. 7. Afonso et al (2017) The recent review of 42 randomised controlled clinical trial found the issues related to the study design, and data analysis of previous research studies done namely a) study duration of all studies was short i.e. < 9 months and hence long-term effect of periodization could not be estimated; b) most of the studies used mean and standard deviation to report the effect of the training; c) comparison was done between different group of athletes; d) only 9 studies mapped the impact of training using effect sizes; and e) not a single study included in the review reported zero; low; and high training effect on athletes. This review also identified that none of the studies predicted any direction; quantity; and timing of the periodization effect on the athlete's fitness. 8. Balagué et al. (2016) The means of all the physiological tests were compared between the four groups using Friedman ANOVA. Principal Component Analysis (PCA) was done using 6 selected cardiorespiratory variables with varimax rotation in each group of students with repeated measures training period. Tucker's congruence coefficient was applied to the PCs to compare them before training, after training and detraining. Median of the first principal component (PC1) coefficient of congruence was computed and compared across the four groups and 3 periods using non-parametric Cohen’s d effect size was calculated for each group and period to study the differences due to the training. 9. Percy (2015) Authors carried out deep learning models; decision analysis for these various stochastic processes using subjective and objective model parameter's prior distribution and used sequential Bayesian posterior distributions effectively updating the model parameters for outcome prediction during the live match in real time. Authors used university boat race data as a case study and explained how sequential Bayesian posterior distributions updated during the race using Bernoulli processes and a discrete time Markov models for race timing outcome prediction. Authors also expressed the need to conduct more research on time-varying prediction modeling in sports. 10. Demirkan et al. (2015) One-way analysis of variance was used to compare the differences between the groups lightweight elite; lightweight amateur; middleweight elite; middleweight amateur; heavyweight elite; and heavyweight amateur wrestlers. Tukey’s test was used for post-hoc comparisons. A multivariable logistic regression analysis was used to study the physical and physiological predictors of wrestling success. The prediction accuracy of the logistic regression model was 84.8%. 11. Bidaurrazaga-Letona et al. (2015) A multi-level model was developed using within individual athlete repeated measures of physical fitness tests at level 1 and individual athletes at level 2. All independent variables were log transformed due to the curvilinear relationship. Baseline model without somatic maturity status and the interaction term was developed; while to study the effect of maturity and interaction, the second model was developed by adding these variables in the model. Baseline model and the model with added two variables of maturity were tested and validated using variance explained and Akaike Information Criteria (AIC). 12. Turner (2014) An individual-level athleticism composite index was developed using fitness battery tests results for the selection of fencers. Z-scores were calculated for each physical and motor skill tests and "Total Score of Athleticism (TSA)” index was calculated by adding the z-scores of all the tests applied on an individual athlete. Further ranks were given based on athlete’s TSA score; athletes with higher TSA score were given higher rank. Finally, the athletes were classified as poor, average and best athlete based on their ranks. Objectives The objectives of the proposed research are to use conventional statistical and machine learning methods  To develop models for identification and selection of sports performance predictors  To develop models for fitness and performance evaluations  To develop models for talent prediction and selection  To compare conventional statistical and machine learning predictive models 1. Methodology Secondary data of individual athlete level and event level data will be used for the proposed research. Periodic fitness and sports-specific skills data will also be collected for each athlete. Periodic testing of athlete’s fitness and skills is done generally every quarter of the year as per training phases of the sports discipline. Each athlete is tested one to four times in a year as per requirement of the training and monitoring. Individual athlete’s level data Athlete’s data includes demographics such as age, education and native place of the athlete; sports-specific data such as sports discipline, event, weight category (if applicable), training age and sports age. Fitness and motor skill testing include Human Performance (HP) laboratory and field testing data of each athlete. HP laboratory testing data includes aerobic capacity i.e. the volume of oxygen consumption of athletes and anaerobic capacity i.e. the power of athletes. Field testing data includes height, weight, speed, agility, flexibility, muscular strength and endurance, motor skills and sports specific skill of athletes. Performance data included the participation of athlete in the various competitions in a year, type of competition such that state or national or international level competition, type and number medals won in the competitions by each athlete. Event level data Training data is available at sports specific event level and not on an individual athlete level. Training plans data about training cycles, duration of those cycles, training phases, training intensity and volume of each phase will be used for the data analysis. Independent variables Independent variables such as demographics, sports-specific details, training, fitness, and skills of athletes will be compiled using various datasets. Fitness index will be compiled using fitness and sports-specific skill variables. Fitness index will also be treated as an independent variable. Dependent/ Outcome variables Two outcome variables of the athlete's actual performance will be compiled using competitions and medals data are as follows - Outcome 1: Athlete’s status - winner/looser/not participated - Outcome 2: Aggregate performance index of each athlete Data will be transformed, cleaned, and missing values will be replaced by using suitable statistical methods. All the analysis datasets will be prepared and merged using the athlete's unique identification number. An exploratory analysis will be carried out first to understand the data. Normality of the quantitative variables will be tested using the Kolmogorov-Smirnov test. Levene’s test will be used for testing equality of variances. Parametric tests will be used for the analysis of normally distributed quantitative variables. Non-parametric tests will be used for non-normal and qualitative variables. Statistical significance will be set as p ≤ 0.05 and the analysis will be done in the R software environment. Conventional statistical and machine learning methods will be used for the data analysis listed but not limited are as follows a) Conventional statistical methods  Paired/ unpaired t-test, one-way and repeated measures analysis of variance, Chisquare test  Principal component analysis (PCA)  Discriminant analysis    Linear/ multiple correlation and regression Univariate/ multiple logistic regression, polynomial regression General linear models  random effect multilevel models  Time-series  Bayesian inference  Markov modelling b) Machine learning methods  Linear/ multiple regression  Univariate/ multiple logistic regression, polynomial regression  PCA and PCA regression  Classification and regression trees  Decision trees  Support vector machines  Random forest  K-mean clustering  Neural network  Time series Predictive models once developed using statistical and machine learning methods will be compared in terms of assumptions, minimal sufficiency of sample size required, estimated errors, accuracy, sensitivity, specificity, and the time efficiency. Scope of the work The proposed study will use the machine learning methods for developing predictive models that are not commonly used in sports analyses. It will be helpful to the coaches and athletes in training as per periodization and improving sports performance. The predictive modeling will be useful to the coaches, sports scientists and decision makers to select and develop the best athletes for the sport. 2. Work plan Year 1  Identify a research problem  Review of literature for identification of existing knowledge, strengths and gaps in the previous research  Develop a research proposal  Attend Ph.D. coursework, write assignments, and complete the coursework  Prepare synopsis and presentation Year 2  Present synopsis at doctoral research committee  Data collection, compilation, management, and preparing analysis datasets  Data analysis and interpretation of the results  Review of literature for the latest development in the research areas  Prepare and submit the summary of research  Prepare manuscripts and present papers at national/ international conferences Year 3  Complete the thesis writing and submit it for plagiarism  Prepare and appear for Pre Ph.D. Viva  Modify the thesis and submit the final thesis  Prepare and defend final Viva

machine learning

Related documents

Products

Support

machine learning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib