A Multivariate Analysis on the 2004 Summer Olympic Games Wei Xiong, M.Sc Student, Department of Mathematics and Statistics, University of Guelph May 12-13, 2005 1 OUTLINE 1. Introduction • • • 2004 Summer Olympic Games Multivariate techniques: cluster analysis, multivariate analysis of variance, multivariate regression analysis Literature review of analyses on Olympic Games 2. Data Analysis and Discussion 3. Conclusions 2 2004 Summer Olympic Games • the largest event, 11,000 athletes from 202 countries, 929 metals won by 75 countries/regions. Multivariate (>1 response variable) Techniques • Cluster Analysis: obs’n (countries) classified into clusters (groups) based on each obsn’s similarity of multi variables (number of gold, silver, bronze and total), by measuring the distance or dissimilarity between any two clusters. 3 • Multivariate Analysis of Variance (MANOVA): a generalization of ANOVA, used to compare more than two population mean vectors Hypothesis: H0: 1 = … = t versus Ha: j ≠ k (for some j ≠k) H0 is rejected if H = SS(Treatment) >> E = SS(Error) Wilk’s statistic = |E| / |E+H| 4 • Multivariate Regression model: Y (nxp) = X (nxq) (qxp) + E (nxp) where n: observations, p: response variables, q: explanatory variables Least square estimator of is: (X'X )-1X'Y 5 Literature review • Condon et al [1] tried to predict a country’s success at the Olympic Games using linear regression models and neural network models. Lins et al [2] developed a Data Envelopment Analysis (DEA)based model to rank each country based on its ability to win medals in relation to its available resources. Churilov and Flitman [3] improved the Data Envelopment Analysis (DEA)-based model by combining different sets of input parameters with the DEA model. • • This study: uses multivariate techniques to analyze the 2004 Summer Olympic Games and try to explore the factors that influence the number of medals won. 6 Table 1: Rankings For Participating Countries Country Gold (y1) Silver (y2) Bronze (y3) Total (y4) Ranking (by Gold) [4] Ranking (by Cluster Analysis) USA 35 39 29 103 1 1 China 32 17 14 63 2 2 Russia 27 27 38 92 3 1 Canada 3 6 3 12 21 4 Syrian 0 0 1 1 71 5 Trinidad 0 0 1 1 71 5 Note: number of countries in cluster 1, 2, 3, 4 and 5 are 2, 3, 7, 7, 56 respectively. 7 Table 2: Least Square Means for Group Medals Medals y1 (Gold) y2 (Silver) y3 (Bronze) y4 (Total) 31.00 33.00 33.50 97.50 21.00 16.33 16.00 53.33 3 10.43 8.86 # 11.00 30.29 4 4.86 7.00 # 5.29 17.14 5 1.23 1.34 1.75 4.32 Group 1 (USA, RUS) 2 (CHN, AUS, GER ) Note: # close to each other 8 Multivariate Analysis of Variance (MANOVA): Compares the metal means for the 5 groups proc glm; class group model y1-y4=group; manova h=group; lsmeans group/pdiff; run; MANOVA Test: Hypothesis of No Overall Group Effect Statistic Value F Value Pr > F Wilks' Lambda 0.02126952 49.34 <.0001 9 Least Squares Means for effect group for silver (y2) Pr > |t| for H0: LSMean(i)=LSMean(j) i/j 2 3 4 5 1 <.0001 <.0001 <.0001 <.0001 2 3 4 <.0001 <.0001 <.0001 0.0572 <.0001 <.0001 Note: p-values for other metals < 0.0001 10 ? WHY • Why some countries won more medals and the others won less • Hypotheis: the larger the population and GDP, the more the medals Population: the larger the population (x1), the more the outstanding athletes available GDP (Gross Domestic Product): the higher the GDP, the more the funding for athletes training 11 Table 3: Multivariate Regression of Medals on Population (x1) [5] and GDP (x2) [6] proc glm; model y1-y4 = x1-x2/xpx i; run; y’s Number of Gold (y1) Number of Silver (y2) Number of Bronze (y3) Number of Total (y4) 1 p-value 2 p-value 3 p-value 4 p-value x1 (million) 0.0116 0.0002 0.0043 0.1223 0.0031 0.3712 0.0190 0.0317 x2 ($billion) 0.0031 <.0001 0.0033 <.0001 0.0027 <.0001 0.0091 <.0001 x’s 12 Conclusions The 2004 Summer Olympic Games are analyzed using multivariate methods: Cluster Analysis, Multivariate Analysis of Variance, Multivariate Regression Analysis. Participating countries are classified into 5 groups based on their number of medals won. It is found that each group differs significantly in terms of the number of medals in that group. 13 Population and GDP are two significant factors for each group’s number of medals: an increase of 1 million in population increase the number of gold by 0.0116, or the number of total medals by 0.019. 1 billion’s increase in GDP increase the number of gold by 0.0031, silver 0.0033, bronze 0.0027, or total by 0.0091. References [1] Edward M. Condon, Bruce L. Golden and Edward A. Wasil (1999). Predicting the success of nations at the Summer Olympics using neural networks. Computers & Operations Research. 26(13),1243-1265. 14 [2] Marcos P. Estellita Lins, Eliane G. Gomes, João Carlos C. B. Soares de Mello and Adelino José R. Soares de Mello (2003). Olympic ranking based on a zero sum gains DEA model. European Journal of Operational Research. 148(2), 312-322. [3] L. Churilov and A. Flitman (2004). Towards fair ranking of Olympics achievements: the case of Sydney 2000. Computers & Operations Research. Available online 6 November 2004. [4] http://www.athens2004.com/en/OlympicMedals/medals, accessed May 11, 2005. [5] http://www.geohive.com/global/index.php, accessed Nov. 25, 2004. [6] http://www.geohive.com/global/geo.php?xml=ec_gdp1&xsl=ec_gdp1, accessed May 11, 2005. 15 16 Appendix 1 Table 1. Number of metals for each country/region • Country/Region,Gold,Silver,Bronze,Total • USA 35,39,29,103 CHN 32,17,14,63 RUS 27,27,38,92 AUS17,16,16,49 JPN16,9,12,37 GER 14,16,18,48 FRA11,9,13,33 ITA 10,11,11,32 KOR 9,12,9,30 GBR 9,9,12,30 CUB 9 7 11 27 UKR 9 5 9 23 HUN 8 6 3 17 ROM 8 5 6 19 GRE 6 6 4 16 NOR 5 0 1 6 NED 4 9 9 22 BRA 4 3 3 10 SWE 4 1 2 7 ESP 3 11 5 19 CAN 3 6 3 12 TUR 3 3 4 10 POL 3 2 5 10 NZL 3 2 0 5 THAThailand314826BLRBelarus2671527AUTAustria241728ETHEthiopia232729IRII. R.Iran222630SVKSlovakia222631TPEChineseTaipei221532GEOGeorgia220433BUL Bulgaria2191234JAMJamaica212535UZBUzbekistan212536MARMorocco210337DE NDenmark206838ARGArgentina204639CHIChile201340KAZKazakhstan143841KEN Kenya142742CZECzechRepublic134843RSASouthAfrica132644CROCroatia122545 LTULithuania120346EGYEgypt113547SUISwitzerland113548INAIndonesia112449ZI MZimbabwe111350AZEAzerbaijan104551BELBelgium102352BAHBahamas101253I SRIsrael101254CMRCameroon100155DOMDominicanRep100156IRLIreland100157 UAEUArabEmirates100158PRKDPRKorea041559LATLatvia040460MEXMexico0314 61PORPortugal021362FINFinland020263SCGSerbia.Monteneg020264SLOSlovenia 013465ESTEstonia012366HKGHongKong010167INDIndia010168PARParaguay0101 69NGRNigeria002270VENVenezuela002271COLColombia001172ERIEritrea001173 MGLMongolia001174SYRSyrianArabRep001175TRITrinidad.Tobago0011 17 SAS coding-1 data Anthemn2004SummerOlympic; input Country $ y1-y4; cards; see Table 1 for data ; proc cluster method=eml standard rmsstd rsquare outtree=tree; var y1-y4 ; id country; run; proc tree data=tree noprint n=5 out=countryout; id country; run; proc tree data=tree n=5; id country; run; proc sort; by country; proc sort data=Anthemn2004SummerOlympic out=new; by country; data temp; merge new countryout; by country; proc sort; by cluster; proc print; id country; proc factor heywood rotate=varimax, quartimax; var y1-y4 ; by cluster; proc princomp; var y1-y4 ; run; proc factor heywood rotate=varimax, quartimax; var y1-y4 ; run; 18 SAS coding-2 data Anthemn2004SummerOlympic; input group y1-y4 x1-x2; cards; 5 35 39 29 103 273 10882 5 27 27 38 92 146 433 4 32 17 14 63 1247 1410 4 17 16 16 49 19 518 4 14 16 18 48 82 2401 ; proc glm; class group; model y1-y4=group; manova h=group/printe printh; lsmeans group/pdiff; run; 19 SAS coding data Anthemn2004SummerOlympic; input group y1-y4 x1-x2 ; cards; …………….. ; proc corr; var y1-y4 x1-x2; run; proc glm; model y1-y4 = x1-x2/xpx i; MANOVA H=x1 x2 /printe printh; run; 20 Cluster analysis: Countries Classified into 5 Groups - 644 - 144 L o g L i k e l i h o o d 356 856 1356 1856 Groups: 2356 54 3 2 1 U R C A G J F G I K C U H G R C B N E N S N G L MI S C T J U A A E S S C B I C D I U I E Z B S U H U E P R B T O U K U R O A L E S O WZ E T A R V R P A Z R Z G U L H A S M O R A N S I E A S N S R N A R A R B R N E M N R D P R E L O U R I K O E MB G E Y I O I H R R ML E A T ML U R C A G J F G I K C U H G R C B N S N S N G L MI S C C J U A A E S S C B I C D I U I E Z B n u h u e a r r t o u k u r o a e e p o w e e i o . l r h a z r z g wl h a s a o r A n s i e i s i s r p a e a r b r n e m n l t a r e w o t r R o o i mb g e y i o i h r m me r d t ml t s n t ma n a l e a a g e a a a h i wd Z r h o . v a n a e e r p t v l a a e i l a o o b g e i a r a n c t y a i a c n d r e n a e e g u c I a t e i k n b t z e e me r n a b n n a i CAN N V C E MS T F S H I P A E R K GE O R GY R I C K N A U T S A R N L I L R I N G GD R T H A Z N V C E MS T F S H I P A E S K i eo r oy r i e ona ut oa gnl i nr i nr ndr s huz eeo t gi nl b gi a t i t a r z mr o a i a i K a g r o h k K C P L MP B T P T B D E Z R A E ORUOHUE NE K T X R A RL A L N K C D L MP B T P T B D ez Paeo r uo hue neRt x r ar l al n yc Kvi t z ka i gm aho i cu i en l aa Co u n t r y 21 Table 2: Factor Analysis on Metals Group 1 2 Latent Factor 1 1 1 2 (94.71) * (95.99) (61.35) Gold (y1) 0.9634 # 0.9997 Silver (y2) 0.9694 Bronze (y3) Total (y4) (%) 3 4 5 (86.50) 1 (52.10) 2 (83.89) 1 (58.04) 2 (83.09) 0.8783 -.0055 -.0378 0.9844 0.7654 0.0052 0.9839 0.1470 0.9873 0.6388 -.5449 0.1409 0.9893 0.9595 - .9414 0.8314 -.0629 0.8151 -.1716 0.8546 -.1100 0.9999 0.9928 0.8682 0.4932 0.9551 0.2727 0.9186 0.3908 Note: * cumulative eigenvalues, percentage of total variation explained in the four variables (metals) # Factor loading, correlation between latent factor and variables (Factor Analysis, rotation = quartimax, make latent factor strongly or weakly correlated to variables) 22 Correlation Between y’s and x’s x1 ( # Population) [2] , x2 ( # GDP, Gross Domestic Product) [3] Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 x1 x2 y1 y2 0.46543 <.0001 0.70219 <.0001 0.3038 0.0081 0.76180 <.0001 y3 0.23199 0.0452 0.60769 <.0001 y4 0.34887 0.0022 0.71640 <.0001 Note: reasonable correlation between y’s and x1, large correlation between y’s and x2. # Both population and GDP are in 2003 23