Document

advertisement
A Multivariate Analysis on the 2004
Summer Olympic Games
Wei Xiong, M.Sc Student,
Department of Mathematics and Statistics,
University of Guelph
May 12-13, 2005
1
OUTLINE
1. Introduction
•
•
•
2004 Summer Olympic Games
Multivariate techniques: cluster analysis,
multivariate analysis of variance, multivariate
regression analysis
Literature review of analyses on Olympic Games
2. Data Analysis and Discussion
3. Conclusions
2

2004 Summer Olympic Games
•
the largest event, 11,000 athletes from 202
countries, 929 metals won by 75 countries/regions.

Multivariate (>1 response variable) Techniques
•
Cluster Analysis: obs’n (countries) classified into clusters
(groups) based on each obsn’s similarity of multi variables
(number of gold, silver, bronze and total), by measuring the
distance or dissimilarity between any two clusters.
3
• Multivariate Analysis of Variance (MANOVA):
a generalization of ANOVA, used to compare more than
two population mean vectors
Hypothesis:
H0: 1 = … =  t
versus
Ha:  j ≠  k (for some j ≠k)
H0 is rejected if H = SS(Treatment) >> E = SS(Error)
Wilk’s statistic = |E| / |E+H|
4
• Multivariate Regression
model: Y (nxp) = X (nxq)  (qxp) + E (nxp)
where n: observations,
p: response variables,
q: explanatory variables
Least square estimator of  is:
(X'X )-1X'Y
5

Literature review
•
Condon et al [1] tried to predict a country’s success at the
Olympic Games using linear regression models and neural
network models.
Lins et al [2] developed a Data Envelopment Analysis (DEA)based model to rank each country based on its ability to win
medals in relation to its available resources.
Churilov and Flitman [3] improved the Data Envelopment
Analysis (DEA)-based model by combining different sets of
input parameters with the DEA model.
•
•

This study: uses multivariate techniques to analyze the 2004
Summer Olympic Games and try to explore the factors that
influence the number of medals won.
6
Table 1: Rankings For Participating Countries
Country
Gold
(y1)
Silver
(y2)
Bronze
(y3)
Total
(y4)
Ranking
(by Gold)
[4]
Ranking
(by Cluster
Analysis)
USA
35
39
29
103
1
1
China
32
17
14
63
2
2
Russia
27
27
38
92
3
1
Canada
3
6
3
12
21
4
Syrian
0
0
1
1
71
5
Trinidad
0
0
1
1
71
5
Note: number of countries in cluster 1, 2, 3, 4 and 5
are 2, 3, 7, 7, 56 respectively.
7
Table 2: Least Square Means for Group Medals
Medals
y1 (Gold)
y2 (Silver)
y3 (Bronze)
y4 (Total)
31.00
33.00
33.50
97.50
21.00
16.33
16.00
53.33
3
10.43
8.86 #
11.00
30.29
4
4.86
7.00 #
5.29
17.14
5
1.23
1.34
1.75
4.32
Group
1
(USA, RUS)
2
(CHN, AUS, GER )
Note: # close to each other
8
Multivariate Analysis of Variance (MANOVA):
Compares the metal means for the 5 groups
proc glm;
class group
model y1-y4=group;
manova h=group;
lsmeans group/pdiff;
run;
MANOVA Test: Hypothesis of No Overall Group Effect
Statistic
Value
F Value
Pr > F
Wilks' Lambda
0.02126952
49.34
<.0001
9
Least Squares Means for effect group for silver (y2)
Pr > |t| for H0: LSMean(i)=LSMean(j)
i/j
2
3
4
5
1
<.0001
<.0001
<.0001
<.0001
2
3
4
<.0001
<.0001
<.0001
0.0572
<.0001
<.0001
Note: p-values for other metals < 0.0001
10
?
WHY
• Why some countries won more medals and the others won less
• Hypotheis: the larger the population and GDP, the more the
medals
Population: the larger the population (x1),
the more the outstanding athletes available
GDP (Gross Domestic Product): the higher the GDP,
the more the funding for athletes training
11
Table 3: Multivariate Regression of Medals on
Population (x1) [5] and GDP (x2) [6]
proc glm;
model y1-y4 = x1-x2/xpx i;
run;
y’s
Number of Gold
(y1)
Number of
Silver (y2)
Number of
Bronze (y3)
Number of Total
(y4)
1
p-value
2
p-value
3
p-value
4
p-value
x1
(million)
0.0116
0.0002
0.0043
0.1223
0.0031
0.3712
0.0190
0.0317
x2
($billion)
0.0031
<.0001
0.0033
<.0001
0.0027
<.0001
0.0091
<.0001
x’s
12
Conclusions
 The 2004 Summer Olympic Games are analyzed using multivariate
methods: Cluster Analysis, Multivariate Analysis of Variance, Multivariate
Regression Analysis.
 Participating countries are classified into 5 groups based on their number
of medals won. It is found that each group differs significantly in terms of
the number of medals in that group.
13
 Population and GDP are two significant factors for each group’s number of
medals: an increase of 1 million in population increase the number of gold
by 0.0116, or the number of total medals by 0.019. 1 billion’s increase in
GDP increase the number of gold by 0.0031, silver 0.0033, bronze 0.0027,
or total by 0.0091.
References
[1] Edward M. Condon, Bruce L. Golden and Edward A. Wasil (1999).
Predicting the success of nations at the Summer Olympics using neural
networks. Computers & Operations Research. 26(13),1243-1265.
14
[2] Marcos P. Estellita Lins, Eliane G. Gomes, João Carlos C. B. Soares
de Mello and Adelino José R. Soares de Mello (2003). Olympic ranking
based on a zero sum gains DEA model. European Journal of
Operational Research. 148(2), 312-322.
[3] L. Churilov and A. Flitman (2004). Towards fair ranking of Olympics
achievements: the case of Sydney 2000. Computers & Operations
Research. Available online 6 November 2004.
[4] http://www.athens2004.com/en/OlympicMedals/medals, accessed
May 11, 2005.
[5] http://www.geohive.com/global/index.php, accessed Nov. 25, 2004.
[6] http://www.geohive.com/global/geo.php?xml=ec_gdp1&xsl=ec_gdp1,
accessed May 11, 2005.
15
16
Appendix 1
Table 1. Number of metals for each
country/region
•
Country/Region,Gold,Silver,Bronze,Total
•
USA 35,39,29,103 CHN 32,17,14,63 RUS 27,27,38,92 AUS17,16,16,49
JPN16,9,12,37 GER 14,16,18,48 FRA11,9,13,33 ITA 10,11,11,32 KOR 9,12,9,30
GBR 9,9,12,30 CUB 9 7 11 27 UKR 9 5 9 23 HUN 8 6 3 17 ROM 8 5 6 19 GRE 6
6 4 16 NOR 5 0 1 6 NED 4 9 9 22 BRA 4 3 3 10 SWE 4 1 2 7 ESP 3 11 5 19 CAN
3 6 3 12 TUR 3 3 4 10 POL 3 2 5 10 NZL 3 2 0 5
THAThailand314826BLRBelarus2671527AUTAustria241728ETHEthiopia232729IRII.
R.Iran222630SVKSlovakia222631TPEChineseTaipei221532GEOGeorgia220433BUL
Bulgaria2191234JAMJamaica212535UZBUzbekistan212536MARMorocco210337DE
NDenmark206838ARGArgentina204639CHIChile201340KAZKazakhstan143841KEN
Kenya142742CZECzechRepublic134843RSASouthAfrica132644CROCroatia122545
LTULithuania120346EGYEgypt113547SUISwitzerland113548INAIndonesia112449ZI
MZimbabwe111350AZEAzerbaijan104551BELBelgium102352BAHBahamas101253I
SRIsrael101254CMRCameroon100155DOMDominicanRep100156IRLIreland100157
UAEUArabEmirates100158PRKDPRKorea041559LATLatvia040460MEXMexico0314
61PORPortugal021362FINFinland020263SCGSerbia.Monteneg020264SLOSlovenia
013465ESTEstonia012366HKGHongKong010167INDIndia010168PARParaguay0101
69NGRNigeria002270VENVenezuela002271COLColombia001172ERIEritrea001173
MGLMongolia001174SYRSyrianArabRep001175TRITrinidad.Tobago0011
17
SAS coding-1
data Anthemn2004SummerOlympic;
input Country $ y1-y4;
cards;
see Table 1 for data
;
proc cluster method=eml standard rmsstd rsquare outtree=tree;
var y1-y4 ;
id country;
run;
proc tree data=tree noprint n=5 out=countryout;
id country;
run;
proc tree data=tree n=5;
id country;
run;
proc sort;
by country;
proc sort data=Anthemn2004SummerOlympic out=new;
by country;
data temp;
merge new countryout;
by country;
proc sort;
by cluster;
proc print;
id country;
proc factor heywood rotate=varimax, quartimax;
var y1-y4 ;
by cluster;
proc princomp;
var y1-y4 ;
run;
proc factor heywood rotate=varimax, quartimax;
var y1-y4 ;
run;
18
SAS coding-2
data Anthemn2004SummerOlympic;
input group y1-y4 x1-x2;
cards;
5 35 39
29
103
273
10882
5 27 27
38
92
146
433
4 32 17
14
63
1247 1410
4 17 16
16
49
19
518
4 14 16
18
48
82
2401
;
proc glm;
class group;
model y1-y4=group;
manova h=group/printe printh;
lsmeans group/pdiff;
run;
19
SAS coding
data Anthemn2004SummerOlympic;
input group y1-y4 x1-x2 ;
cards;
……………..
;
proc corr;
var y1-y4 x1-x2;
run;
proc glm;
model y1-y4 = x1-x2/xpx i;
MANOVA H=x1 x2 /printe printh;
run;
20
Cluster analysis: Countries Classified into 5 Groups
- 644
- 144
L
o
g
L
i
k
e
l
i
h
o
o
d
356
856
1356
1856
Groups:
2356
54
3
2
1
U R C A G J F G I K C U H G R C B N E N S N G L MI S C T J U A A E S S C B I C D I U I E Z B
S U H U E P R B T O U K U R O A L E S O WZ E T A R V R P A Z R Z G U L H A S M O R A N S I E
A S N S R N A R A R B R N E M N R D P R E L O U R I K O E MB G E Y I O I H R R ML E A T ML
U R C A G J F G I K C U H G R C B N S N S N G L MI S C C J U A A E S S C B I C D I U I E Z B
n u h u e a r r t o u k u r o a e e p o w e e i o . l r h a z r z g wl h a s a o r A n s i e
i s i s r p a e a r b r n e m n l t a r e w o t r R o o i mb g e y i o i h r m me r d t ml
t s n t ma n a l e a a g e a a a h i wd Z r h o . v a n a e e r p t v l a a e i l a o o b g
e i a r a n c t y a i a c n d r e n a e e g u c I a t e i k n b t z e e me r n a b n n a i
CAN
N V C E MS T F S H I P A E R K
GE O R GY R I C K N A U T S A
R N L I L R I N G GD R T H A Z
N V C E MS T F S H I P A E S K
i eo r oy r i e ona ut oa
gnl i nr i nr ndr s huz
eeo t gi nl b gi a t i t a
r z mr o a i a i K a g r o h k
K C P L MP B T P T B D
E Z R A E ORUOHUE
NE K T X R A RL A L N
K C D L MP B T P T B D
ez Paeo r uo hue
neRt x r ar l al n
yc Kvi t z ka i gm
aho i cu i en l aa
Co u n t r y
21
Table 2: Factor Analysis on Metals
Group
1
2
Latent
Factor
1
1
1
2
(94.71) *
(95.99)
(61.35)
Gold
(y1)
0.9634 #
0.9997
Silver
(y2)
0.9694
Bronze
(y3)
Total
(y4)
(%)
3
4
5
(86.50)
1
(52.10)
2
(83.89)
1
(58.04)
2
(83.09)
0.8783
-.0055
-.0378
0.9844
0.7654
0.0052
0.9839
0.1470
0.9873
0.6388
-.5449
0.1409
0.9893
0.9595
- .9414
0.8314
-.0629
0.8151
-.1716
0.8546
-.1100
0.9999
0.9928
0.8682
0.4932
0.9551
0.2727
0.9186
0.3908
Note: * cumulative eigenvalues, percentage of total variation explained in the four variables (metals)
#
Factor loading, correlation between latent factor and variables
(Factor Analysis, rotation = quartimax, make latent factor strongly or weakly correlated to variables)
22
Correlation Between y’s and x’s
x1 ( # Population) [2] , x2 ( # GDP, Gross Domestic Product) [3]
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
x1
x2
y1
y2
0.46543
<.0001
0.70219
<.0001
0.3038
0.0081
0.76180
<.0001
y3
0.23199
0.0452
0.60769
<.0001
y4
0.34887
0.0022
0.71640
<.0001
Note: reasonable correlation between y’s and x1,
large correlation between y’s and x2.
# Both population and GDP are in 2003
23
Download