Using Clustering to Develop a College Football

advertisement
Using Clustering to
Develop a College
Football Ranking System
Fall 2005
ECE 539
Final Project
Joseph Detmer
Joseph Detmer
Page 1
ECE 539
Abstract
This project went through the task of creating a college football ranking system
based purely on statistics. This data was obtained from two impartial websites.
The algorithm (explained in greater detail below) first clusters the data and then
places this clustering into an equation where the final rank is resultant from. It
was found that a reasonable system was created. Data from our test data was
very close to that of polls taken.
Introduction
Being a former soccer player, I have long since been wary to embrace the game
of football. The game was back and forth, long, and there was a lot of time
between the actual “action.” High school football is not all that entertaining, and I
never really was able to embrace a professional or collegiate team, feeling too
distant from any particular team. Then I came to college. Games were much
more exciting – the amount and enthusiasm of fans, the athletes were bigger,
and the skill of the players very much improved. Not many other things can
make thousands of people wake up early on a cool Saturday morning in the fall
to go sit in a parking lot grilling brats several hours. While professional football is
extremely exciting, college football is not far off, and is held higher by some
people. Whether it be an unranked team upsetting a highly ranked opponent, a
national title game, or an alma mater is playing the always hated arch-rival,
college football entertains millions.
At the end of each season, any college football team that has at least 6 wins is
eligible for a bowl. Since 1998, a computer system has been in place to
determine the top college football teams in the country. The initial purpose of the
system was to match “equal” teams up to play in exciting contests for more than
just the national championship game. This system brought the top four bowl
games together, letting each have a turn for the national title game. These bowls
Joseph Detmer
2
ECE 539
initially had been reserved for the champions of the best conferences. This new
system allowed a good team from a somewhat weaker conference to play in a
big game at the end of the season.
Any system created to predict how good a team is or who wins a game will
eventually fail. There are too many factors that cannot be taken into account,
such as star players being injured, an amazing strategy developed by a coaching
staff, or just a team having an off day. This process of determining who the best
is will be done by determining on the average, which team is the best.
Motivation
The most difficult question to determine is, how do you qualify how “good” a team
is? Is it their record? Is it how many points they score, total yards of offense
they have, or turnovers they create? If one was to look up statistics for college
football teams, one could find any kind of stat they would ever want. For
example, one could find the rankings of all teams average yardage on 1st down in
the 3rd quarter, if you wanted it. However, such information seems slightly too
specific for the question of how “good” a team is. Most experts would agree that
a combination of many factors can be combined into a quantitative
representation of how good a team is. The BCS system uses several computer
models combined with polls to determine its rankings. I will create a system
using only a computer model.
In this project, it was determined that the data necessary for determining how
good a team is would be done by a small subset of general data. These data
points will not directly corresponding to inputs to the system. Each data set will
first be clustered into several clusters. Data from one year will then create a
function from the clustered data. This function will be applied to a second years
(2004) data, which will result in a test set.
Joseph Detmer
3
ECE 539
Data Collection
The first and most difficult part of the project was getting the necessary data.
The first thing that was necessary was to decide what statistics are most
influential in how good a team was. As above, there are a huge amount of
statistics that could be retrieved. Which ones should be chosen? We want a
large enough set of data to have good results, but not so large that the statistics
become redundant and the data difficult to obtain. The problem was initially
broken up into pieces:

Offense

Defense

Special Teams & Turnovers

Record & Strength of Schedule
These four pieces can more easily be handled than the whole together.
Offense:
A good offense is integral to how good a team is. If a team is never able to score
points, it will never win a game. It is not unusual for defense to score points, but
to count on them for the entire point production of a football team would be
suicide. So how do we decide how good a team is? Rushing yards have proven
to be a very important part of a football team. If a team cannot run the ball
consistently, it is difficult to tire a defense. Passing yardage is also extremely
important. Passing can produce quick points, or keep the ball in your possession
late in the game. However, a balanced attack is truly the key to a good offense.
While teams that have either a solid running game or solid passing game can
sometimes be effective, a team with both is much better. Having a good rushing
game and a good passing game keeps the defense guessing what will come
next. Finally, we must remember that yards mean nothing if a team does not
score. Offensive scoring is therefore an integral part. This leaves us with 4 data
Joseph Detmer
4
ECE 539
sets for offense:

Rushing yardage

Passing yardage

Total Offensive yardage

Offensive Scoring
Defense
The saying is “Offense wins games, Defense wins championships”. But how do
we determine how good a defense is. Most would say the exact opposite of
determining how good an offense is. Therefore for likewise reasons as in the
offense section, our data sets for defense are:

Rushing yardage allowed

Passing yardage allowed

Total yardage allowed

Total Defensive points allowed
Special Teams & Turnovers
While offense and defense are a huge part of football, a lot of the time, games
come down to special teams. If neither team can produce any yards on offense,
the team with better field position will more likely win. I was hoping to find a
statistic for average starting field position. However after a tedious search
provided nothing of the sort, I opted for statistics that could form something like it.
Finally, we can’t forget turnovers. Often times, a single turnover changes the
course of a game. There are many different types (fumbles, interceptions –
offensive, defensive), but did we really need to take all these different ones into
account. If a team creates many turnovers, but gives up just as many, the team
does not really gain anything. Therefore turnover margin is included in our data
set. This brings data collected in this section to the following sets:

Turnover margin

Net punting yardage

Punt return yardage
Joseph Detmer
5
ECE 539

Kick return yardage

Kick return defensive yardage allowed
Record & Strength of Schedule
The single most important thing taken into account when determining how good a
football team is, is their record. The whole purpose of a football game is to see
who will win the game. If a team loses, it is very difficult to argue that that team
is better. Statistics for how many yards a team has, or how many turnovers they
create are an important part of determining how good a team is. However, the
most important part should be the teams record. A good team will rarely get
beat. A poor team will rarely win. There is an important statistic that should be
taken into account though. If a team has a small amount of losses, but has
played no “good” teams, while a team with a couple losses has only played
“good” teams, who is better if the teams have not played? Strength of schedule
is calculated as 2 parts record of your opponents, 1 part record of your
opponents’ opponents. This develops a decimal between 0 and 1. Our last two
data sets that we will use are:

Record

Strength of schedule
As stated above, there are many websites that have statistics on college football.
Some have very little data, some have quite a few statistics, but few useful ones.
In the end I chose two websites that contained all the data I was looking for.
Since they are different pages, the two sites each required a unique way of
parsing the .htm files for the necessary data. One reason I chose these two
sites was that they kept archived data. Therefore I was able to obtain data from
last year as well as this year. All data except strength of schedule was gathered
from the first site. The data was be gathered from the following two sites:
http://web1.ncaa.org/d1mfb/natlRank.jsp?div=4&site=org
Joseph Detmer
6
ECE 539
http://www.warrennolan.com/football/2005/sos
One final site used was used to classify the training data. I was only able to find
one source that ranked all 119 teams. Most polls and rankings only rank the top
25 teams. This site did not archive data, so I was only able to obtain this ranking
for the current year.
http://cbs.sportsline.com/collegefootball/polls/119/index1
I developed two perl scripts to do the parsing of the data. They are essentially
the same, but slightly modified for different instances. Data for 2005 was used
as the training set, and data for 2004 was used as the testing set. It should be
noted that these scripts were developed after downloading the files on Sun
Solaris machines. Problems occurred when downloading with Internet Explorer
on Windows. It seems that the webpage is saved in different formats. Therefore
if you wish to recreate this project, proceed with caution. It also should be noted
that all directories should be created before any commands are executed. The
commands do not check if directories or files exist before using them. The user
only needs to worry all directories are created, and initial htm files exist and are
named correctly. All Matlab execution was done on a Windows platform. The
two perl files are named extract_stuff.pl and 2004extract_stuff.pl. These
two files output to files that contain the team name as well as the specific data
set.
Data Manipulation
The data sets created by the two initial perl scripts are the inputs to our clustering
method. This clustering method takes each individual data set and clusters it,
using ten clusters. The number of clusters was chosen to be large enough so
that separation between large statistics and small statistics was noticeable. It
was chosen to be still small enough so that clusters were of reasonable size.
Each data set was clustered, and each data point in that set was assigned to a
Joseph Detmer
7
ECE 539
particular cluster. Each cluster was then given a weight (integers from 1 to 10),
according to the rank of that cluster with 1 being the lowest and 10 being the
highest. Once this occurred, the new weights were output to a file, where they
would be rearranged. The two files that took care of this clustering were cl.m
and cl2004.m.
An example clustering is shown below. The data set being clustered is defensive
pass yardage. This cluster shown in blue represents only points that belong to
the 2nd cluster. The remaining cluster centers have been shown, however their
data points have been removed. The voronoi lines have been drawn to show
where one cluster ends and the other begins.
The reason clustering was chosen as the engine of this operation was for its
versatility. From year to year, statistics can vary greatly. In a particularly cold
and rainy reason, little offensive production can lead to few statistics. A trained
Joseph Detmer
8
ECE 539
MLP or fuzzy set would have a much more difficult time dealing with fluctuating
statistics (from year to year) than a clustering algorithm. Rankings should be
created from how good a team is in a particular year, and not compared to other
years.
Once the clustering is complete and the output sent to files, two more perl scripts
reform the data into a very nice array of data. This array contains each teams
cluster results on one line, with the corresponding data set in each column. The
2005 data also is given a rank, which is essentially all the training give it. This
will be discussed later. The two files that reform the data are toFunc.pl and
2004toFunc.pl.
Finally, a single number was calculated for each test set. In order to calculate
this, a function first needed to be created. From the training set (2005 data), we
used a matrix algebra trick. We called our inputs part of vector X. The rank for
each team is known as Y. We will call our coefficient matrix A. We know that
A * X = Y.
A = Y * X-1
However, since X is not square, we cannot take the real inverse of it. Therefore,
we will take the pseudo inverse. Once we get our coefficient matrix, we apply the
coefficients to our test matrix. This gives us approximate rankings. The matlab
file that does this is makefunction.m.
The data is then sorted with a final perl script (getnewranks.pl) and output to its
final file(FINALRANKINGS.TXT).
Results
#1. Southern California - 167.3486
#3. Oklahoma - 116.2092
#2. Auburn - 161.4341
#4. Texas - 112.4908
Joseph Detmer
9
ECE 539
#5. Miami (Fla.) - 112.4448
#47. Colorado - 62.323
#6. Virginia Tech - 111.4653
#48. Toledo - 62.2786
#7. California - 108.7134
#49. Boston College - 62.1604
#8. Florida St. - 108.4097
#50. North Carolina - 61.1465
#9. Utah - 107.8165
#51. Cincinnati - 59.039
#10. Louisville - 107.5966
#52. South Carolina - 58.1333
#11. Iowa - 107.5766
#53. Marshall - 56.4976
#12. Boise St. - 104.7642
#54. Arizona - 55.924
#13. Georgia - 100.8888
#55. Clemson - 55.6617
#14. Bowling Green - 95.4476
#56. Brigham Young - 55.4913
#15. Purdue - 91.1031
#57. TCU - 55.0452
#16. Virginia - 88.6308
#58. Kansas - 54.9249
#17. Arizona St. - 88.2669
#59. UAB - 54.3204
#18. Texas A&M - 86.3354
#60. Maryland - 54.1745
#19. Wisconsin - 83.4357
#61. Iowa St. - 52.9625
#20. Navy - 83.3217
#62. Nebraska - 51.3158
#21. Fla. Atlantic - 83.0395
#63. Washington St. - 50.6705
#22. Ohio St. - 80.6131
#64. Stanford - 50.27
#23. Tennessee - 79.4607
#65. Oregon - 49.8981
#24. UTEP - 79.2717
#66. Alabama - 49.7194
#25. Texas Tech - 79.0969
#67. Akron - 49.5765
#26. Troy - 77.8624
#68. North Carolina St. - 48.8846
#27. Pittsburgh - 77.6789
#69. Tulane - 48.5183
#28. Michigan - 76.7551
#70. UCLA - 48.1761
#29. Memphis - 75.9943
#71. Missouri - 47.5022
#30. Georgia Tech - 75.2869
#72. Kent St. - 47.1419
#31. LSU - 74.5189
#73. Southern Miss. - 46.0345
#32. Miami (Ohio) - 74.3935
#74. Northwestern - 45.1281
#33. Fresno St. - 73.4196
#75. Middle Tenn. St. - 44.2224
#34. Oregon St. - 73.057
#76. North Texas - 44.1114
#35. Connecticut - 72.5143
#77. Michigan St. - 43.7986
#36. Northern Ill. - 72.2317
#78. San Diego St. - 41.1538
#37. Penn St. - 71.3406
#79. Syracuse - 40.7939
#38. Florida - 69.7872
#80. Louisiana Tech - 40.6867
#39. West Virginia - 69.6947
#81. Nevada - 38.9364
#40. Oklahoma St. - 68.7826
#82. Wake Forest - 38.7285
#41. Notre Dame - 68.6967
#83. New Mexico St. - 38.2424
#42. Minnesota - 67.5152
#84. La. Monroe - 36.0539
#43. Hawaii - 66.6198
#85. Rutgers - 34.4375
#44. Arkansas - 64.826
#86. Kansas St. - 32.1734
#45. Wyoming - 63.4262
#87. Colorado St. - 31.5946
#46. New Mexico - 62.9263
#88. Vanderbilt - 31.0494
Joseph Detmer
10
ECE 539
#89. Air Force - 30.8899
#104. Temple - 21.5889
#90. Central Mich. - 30.5194
#105. East Caro. - 18.2943
#91. Eastern Mich. - 30.2132
#106. Indiana - 18.2592
#92. Baylor - 29.4173
#107. Illinois - 18.1962
#93. Tulsa - 28.4209
#108. Ball St. - 16.9403
#94. Ohio - 28.2544
#109. UNLV - 16.4274
#95. South Fla. - 27.7194
#110. Washington - 15.3223
#96. Mississippi - 27.5643
#111. UCF - 15.3098
#97. Western Mich. - 26.2056
#112. Buffalo - 14.6902
#98. La. Lafayette - 25.4322
#113. San Jose St. - 14.4281
#99. Southern Methodist - 24.6299
#114. Idaho - 13.9488
#100. Mississippi St. - 23.9756
#115. Arkansas St. - 12.7652
#101. Houston - 23.5796
#116. Duke - 12.7557
#102. Kentucky - 22.8338
#117. Army - 8.8534
#103. Utah St. - 22.0091
#118. Rice - 5.8056
Discussion
The rankings above are exceptionally close to the rankings of last year. A
comparison poll can be seen at
http://sports.espn.go.com/ncf/rankings?pollId=2&seasonYear=2004&weekNumb
er=17 .
Looking at the top 25 teams from last year, the first four are identical. Also,
according to polls of the top 25 teams from last year, this algorithm classified 20
of those teams in its top 25. This may not seem good, but teams in the lower half
of these rankings often drop out from week to week, while teams just outside the
top 25 come in. It is difficult to say that I am wrong or that the polls are wrong.
They are simply two different opinions, or two different points of view of the same
problem.
On the whole this approach seems to work. A better approach would be to
compile several years worth of data and to create a function from that. I feel as if
this function would give a better representation of rank. This is due to a training
size which is the same size as the testing size. There are also several notes I
Joseph Detmer
11
ECE 539
would like to make on this rank. First, this algorithm can not truly be objective
until the input rank is determined objectively. This rank was obtained between
the use of computer models and human input. So in some way, this algorithm
does not truly determine how good a team is. It is slightly subjective to previous
input. However, I do not see how a computer could start making decisions on
how good a team is without first getting input from a human on what is good or
not. Whether it be training or a preset function. The human programming the
data will be the one who, at least primarily, determines who the “best” team is.
References
http://www.bcsfootball.org/
http://sports.espn.go.com/sports/tvlistings/abcStory?page=aboutbcs
Miscellaneous
Input File Names (Relative to directory where scripts are)
where ? is either a 5 or 4, depending on year
200?data/kickret.htm
kick return yardage
200?data/passoff.htm
pass offense yardage
200?data/scoredef.htm
scoring defense
200?data/totaloff.htm
total offensive yardage
200?data/kickretdef.htm
kick return defense yardage
200?data/puntret.htm
punt return yardage
200?data/scoreoff.htm
scoring offense
200?data/turnover.htm
turnover margin
200?data/netpunt.htm
net punt yardage
200?data/rushdef.htm
rush defense yardage
200?data/passdef.htm
pass defense yardage
Joseph Detmer
2
ECE 539
200?data/rushoff.htm
rush offense yardage
200?data/totaldef.htm
total defensive yardage
200?data/sos.html
strength of schedule
Directories (If using my scripts, you must create these directories before exec.)
doneclust
doneclust2004
parseddata
parseddata2004
prefunc
Output File
FINALRANKINGS.TXT
Source Code
Only one copy of each script of the nearly identical scripts are attached. Both are
included in softcopy. All are attached.
Joseph Detmer
3
ECE 539
Download