Using Clustering to Develop a College Football Ranking System Fall 2005 ECE 539 Final Project Joseph Detmer Joseph Detmer Page 1 ECE 539 Abstract This project went through the task of creating a college football ranking system based purely on statistics. This data was obtained from two impartial websites. The algorithm (explained in greater detail below) first clusters the data and then places this clustering into an equation where the final rank is resultant from. It was found that a reasonable system was created. Data from our test data was very close to that of polls taken. Introduction Being a former soccer player, I have long since been wary to embrace the game of football. The game was back and forth, long, and there was a lot of time between the actual “action.” High school football is not all that entertaining, and I never really was able to embrace a professional or collegiate team, feeling too distant from any particular team. Then I came to college. Games were much more exciting – the amount and enthusiasm of fans, the athletes were bigger, and the skill of the players very much improved. Not many other things can make thousands of people wake up early on a cool Saturday morning in the fall to go sit in a parking lot grilling brats several hours. While professional football is extremely exciting, college football is not far off, and is held higher by some people. Whether it be an unranked team upsetting a highly ranked opponent, a national title game, or an alma mater is playing the always hated arch-rival, college football entertains millions. At the end of each season, any college football team that has at least 6 wins is eligible for a bowl. Since 1998, a computer system has been in place to determine the top college football teams in the country. The initial purpose of the system was to match “equal” teams up to play in exciting contests for more than just the national championship game. This system brought the top four bowl games together, letting each have a turn for the national title game. These bowls Joseph Detmer 2 ECE 539 initially had been reserved for the champions of the best conferences. This new system allowed a good team from a somewhat weaker conference to play in a big game at the end of the season. Any system created to predict how good a team is or who wins a game will eventually fail. There are too many factors that cannot be taken into account, such as star players being injured, an amazing strategy developed by a coaching staff, or just a team having an off day. This process of determining who the best is will be done by determining on the average, which team is the best. Motivation The most difficult question to determine is, how do you qualify how “good” a team is? Is it their record? Is it how many points they score, total yards of offense they have, or turnovers they create? If one was to look up statistics for college football teams, one could find any kind of stat they would ever want. For example, one could find the rankings of all teams average yardage on 1st down in the 3rd quarter, if you wanted it. However, such information seems slightly too specific for the question of how “good” a team is. Most experts would agree that a combination of many factors can be combined into a quantitative representation of how good a team is. The BCS system uses several computer models combined with polls to determine its rankings. I will create a system using only a computer model. In this project, it was determined that the data necessary for determining how good a team is would be done by a small subset of general data. These data points will not directly corresponding to inputs to the system. Each data set will first be clustered into several clusters. Data from one year will then create a function from the clustered data. This function will be applied to a second years (2004) data, which will result in a test set. Joseph Detmer 3 ECE 539 Data Collection The first and most difficult part of the project was getting the necessary data. The first thing that was necessary was to decide what statistics are most influential in how good a team was. As above, there are a huge amount of statistics that could be retrieved. Which ones should be chosen? We want a large enough set of data to have good results, but not so large that the statistics become redundant and the data difficult to obtain. The problem was initially broken up into pieces: Offense Defense Special Teams & Turnovers Record & Strength of Schedule These four pieces can more easily be handled than the whole together. Offense: A good offense is integral to how good a team is. If a team is never able to score points, it will never win a game. It is not unusual for defense to score points, but to count on them for the entire point production of a football team would be suicide. So how do we decide how good a team is? Rushing yards have proven to be a very important part of a football team. If a team cannot run the ball consistently, it is difficult to tire a defense. Passing yardage is also extremely important. Passing can produce quick points, or keep the ball in your possession late in the game. However, a balanced attack is truly the key to a good offense. While teams that have either a solid running game or solid passing game can sometimes be effective, a team with both is much better. Having a good rushing game and a good passing game keeps the defense guessing what will come next. Finally, we must remember that yards mean nothing if a team does not score. Offensive scoring is therefore an integral part. This leaves us with 4 data Joseph Detmer 4 ECE 539 sets for offense: Rushing yardage Passing yardage Total Offensive yardage Offensive Scoring Defense The saying is “Offense wins games, Defense wins championships”. But how do we determine how good a defense is. Most would say the exact opposite of determining how good an offense is. Therefore for likewise reasons as in the offense section, our data sets for defense are: Rushing yardage allowed Passing yardage allowed Total yardage allowed Total Defensive points allowed Special Teams & Turnovers While offense and defense are a huge part of football, a lot of the time, games come down to special teams. If neither team can produce any yards on offense, the team with better field position will more likely win. I was hoping to find a statistic for average starting field position. However after a tedious search provided nothing of the sort, I opted for statistics that could form something like it. Finally, we can’t forget turnovers. Often times, a single turnover changes the course of a game. There are many different types (fumbles, interceptions – offensive, defensive), but did we really need to take all these different ones into account. If a team creates many turnovers, but gives up just as many, the team does not really gain anything. Therefore turnover margin is included in our data set. This brings data collected in this section to the following sets: Turnover margin Net punting yardage Punt return yardage Joseph Detmer 5 ECE 539 Kick return yardage Kick return defensive yardage allowed Record & Strength of Schedule The single most important thing taken into account when determining how good a football team is, is their record. The whole purpose of a football game is to see who will win the game. If a team loses, it is very difficult to argue that that team is better. Statistics for how many yards a team has, or how many turnovers they create are an important part of determining how good a team is. However, the most important part should be the teams record. A good team will rarely get beat. A poor team will rarely win. There is an important statistic that should be taken into account though. If a team has a small amount of losses, but has played no “good” teams, while a team with a couple losses has only played “good” teams, who is better if the teams have not played? Strength of schedule is calculated as 2 parts record of your opponents, 1 part record of your opponents’ opponents. This develops a decimal between 0 and 1. Our last two data sets that we will use are: Record Strength of schedule As stated above, there are many websites that have statistics on college football. Some have very little data, some have quite a few statistics, but few useful ones. In the end I chose two websites that contained all the data I was looking for. Since they are different pages, the two sites each required a unique way of parsing the .htm files for the necessary data. One reason I chose these two sites was that they kept archived data. Therefore I was able to obtain data from last year as well as this year. All data except strength of schedule was gathered from the first site. The data was be gathered from the following two sites: http://web1.ncaa.org/d1mfb/natlRank.jsp?div=4&site=org Joseph Detmer 6 ECE 539 http://www.warrennolan.com/football/2005/sos One final site used was used to classify the training data. I was only able to find one source that ranked all 119 teams. Most polls and rankings only rank the top 25 teams. This site did not archive data, so I was only able to obtain this ranking for the current year. http://cbs.sportsline.com/collegefootball/polls/119/index1 I developed two perl scripts to do the parsing of the data. They are essentially the same, but slightly modified for different instances. Data for 2005 was used as the training set, and data for 2004 was used as the testing set. It should be noted that these scripts were developed after downloading the files on Sun Solaris machines. Problems occurred when downloading with Internet Explorer on Windows. It seems that the webpage is saved in different formats. Therefore if you wish to recreate this project, proceed with caution. It also should be noted that all directories should be created before any commands are executed. The commands do not check if directories or files exist before using them. The user only needs to worry all directories are created, and initial htm files exist and are named correctly. All Matlab execution was done on a Windows platform. The two perl files are named extract_stuff.pl and 2004extract_stuff.pl. These two files output to files that contain the team name as well as the specific data set. Data Manipulation The data sets created by the two initial perl scripts are the inputs to our clustering method. This clustering method takes each individual data set and clusters it, using ten clusters. The number of clusters was chosen to be large enough so that separation between large statistics and small statistics was noticeable. It was chosen to be still small enough so that clusters were of reasonable size. Each data set was clustered, and each data point in that set was assigned to a Joseph Detmer 7 ECE 539 particular cluster. Each cluster was then given a weight (integers from 1 to 10), according to the rank of that cluster with 1 being the lowest and 10 being the highest. Once this occurred, the new weights were output to a file, where they would be rearranged. The two files that took care of this clustering were cl.m and cl2004.m. An example clustering is shown below. The data set being clustered is defensive pass yardage. This cluster shown in blue represents only points that belong to the 2nd cluster. The remaining cluster centers have been shown, however their data points have been removed. The voronoi lines have been drawn to show where one cluster ends and the other begins. The reason clustering was chosen as the engine of this operation was for its versatility. From year to year, statistics can vary greatly. In a particularly cold and rainy reason, little offensive production can lead to few statistics. A trained Joseph Detmer 8 ECE 539 MLP or fuzzy set would have a much more difficult time dealing with fluctuating statistics (from year to year) than a clustering algorithm. Rankings should be created from how good a team is in a particular year, and not compared to other years. Once the clustering is complete and the output sent to files, two more perl scripts reform the data into a very nice array of data. This array contains each teams cluster results on one line, with the corresponding data set in each column. The 2005 data also is given a rank, which is essentially all the training give it. This will be discussed later. The two files that reform the data are toFunc.pl and 2004toFunc.pl. Finally, a single number was calculated for each test set. In order to calculate this, a function first needed to be created. From the training set (2005 data), we used a matrix algebra trick. We called our inputs part of vector X. The rank for each team is known as Y. We will call our coefficient matrix A. We know that A * X = Y. A = Y * X-1 However, since X is not square, we cannot take the real inverse of it. Therefore, we will take the pseudo inverse. Once we get our coefficient matrix, we apply the coefficients to our test matrix. This gives us approximate rankings. The matlab file that does this is makefunction.m. The data is then sorted with a final perl script (getnewranks.pl) and output to its final file(FINALRANKINGS.TXT). Results #1. Southern California - 167.3486 #3. Oklahoma - 116.2092 #2. Auburn - 161.4341 #4. Texas - 112.4908 Joseph Detmer 9 ECE 539 #5. Miami (Fla.) - 112.4448 #47. Colorado - 62.323 #6. Virginia Tech - 111.4653 #48. Toledo - 62.2786 #7. California - 108.7134 #49. Boston College - 62.1604 #8. Florida St. - 108.4097 #50. North Carolina - 61.1465 #9. Utah - 107.8165 #51. Cincinnati - 59.039 #10. Louisville - 107.5966 #52. South Carolina - 58.1333 #11. Iowa - 107.5766 #53. Marshall - 56.4976 #12. Boise St. - 104.7642 #54. Arizona - 55.924 #13. Georgia - 100.8888 #55. Clemson - 55.6617 #14. Bowling Green - 95.4476 #56. Brigham Young - 55.4913 #15. Purdue - 91.1031 #57. TCU - 55.0452 #16. Virginia - 88.6308 #58. Kansas - 54.9249 #17. Arizona St. - 88.2669 #59. UAB - 54.3204 #18. Texas A&M - 86.3354 #60. Maryland - 54.1745 #19. Wisconsin - 83.4357 #61. Iowa St. - 52.9625 #20. Navy - 83.3217 #62. Nebraska - 51.3158 #21. Fla. Atlantic - 83.0395 #63. Washington St. - 50.6705 #22. Ohio St. - 80.6131 #64. Stanford - 50.27 #23. Tennessee - 79.4607 #65. Oregon - 49.8981 #24. UTEP - 79.2717 #66. Alabama - 49.7194 #25. Texas Tech - 79.0969 #67. Akron - 49.5765 #26. Troy - 77.8624 #68. North Carolina St. - 48.8846 #27. Pittsburgh - 77.6789 #69. Tulane - 48.5183 #28. Michigan - 76.7551 #70. UCLA - 48.1761 #29. Memphis - 75.9943 #71. Missouri - 47.5022 #30. Georgia Tech - 75.2869 #72. Kent St. - 47.1419 #31. LSU - 74.5189 #73. Southern Miss. - 46.0345 #32. Miami (Ohio) - 74.3935 #74. Northwestern - 45.1281 #33. Fresno St. - 73.4196 #75. Middle Tenn. St. - 44.2224 #34. Oregon St. - 73.057 #76. North Texas - 44.1114 #35. Connecticut - 72.5143 #77. Michigan St. - 43.7986 #36. Northern Ill. - 72.2317 #78. San Diego St. - 41.1538 #37. Penn St. - 71.3406 #79. Syracuse - 40.7939 #38. Florida - 69.7872 #80. Louisiana Tech - 40.6867 #39. West Virginia - 69.6947 #81. Nevada - 38.9364 #40. Oklahoma St. - 68.7826 #82. Wake Forest - 38.7285 #41. Notre Dame - 68.6967 #83. New Mexico St. - 38.2424 #42. Minnesota - 67.5152 #84. La. Monroe - 36.0539 #43. Hawaii - 66.6198 #85. Rutgers - 34.4375 #44. Arkansas - 64.826 #86. Kansas St. - 32.1734 #45. Wyoming - 63.4262 #87. Colorado St. - 31.5946 #46. New Mexico - 62.9263 #88. Vanderbilt - 31.0494 Joseph Detmer 10 ECE 539 #89. Air Force - 30.8899 #104. Temple - 21.5889 #90. Central Mich. - 30.5194 #105. East Caro. - 18.2943 #91. Eastern Mich. - 30.2132 #106. Indiana - 18.2592 #92. Baylor - 29.4173 #107. Illinois - 18.1962 #93. Tulsa - 28.4209 #108. Ball St. - 16.9403 #94. Ohio - 28.2544 #109. UNLV - 16.4274 #95. South Fla. - 27.7194 #110. Washington - 15.3223 #96. Mississippi - 27.5643 #111. UCF - 15.3098 #97. Western Mich. - 26.2056 #112. Buffalo - 14.6902 #98. La. Lafayette - 25.4322 #113. San Jose St. - 14.4281 #99. Southern Methodist - 24.6299 #114. Idaho - 13.9488 #100. Mississippi St. - 23.9756 #115. Arkansas St. - 12.7652 #101. Houston - 23.5796 #116. Duke - 12.7557 #102. Kentucky - 22.8338 #117. Army - 8.8534 #103. Utah St. - 22.0091 #118. Rice - 5.8056 Discussion The rankings above are exceptionally close to the rankings of last year. A comparison poll can be seen at http://sports.espn.go.com/ncf/rankings?pollId=2&seasonYear=2004&weekNumb er=17 . Looking at the top 25 teams from last year, the first four are identical. Also, according to polls of the top 25 teams from last year, this algorithm classified 20 of those teams in its top 25. This may not seem good, but teams in the lower half of these rankings often drop out from week to week, while teams just outside the top 25 come in. It is difficult to say that I am wrong or that the polls are wrong. They are simply two different opinions, or two different points of view of the same problem. On the whole this approach seems to work. A better approach would be to compile several years worth of data and to create a function from that. I feel as if this function would give a better representation of rank. This is due to a training size which is the same size as the testing size. There are also several notes I Joseph Detmer 11 ECE 539 would like to make on this rank. First, this algorithm can not truly be objective until the input rank is determined objectively. This rank was obtained between the use of computer models and human input. So in some way, this algorithm does not truly determine how good a team is. It is slightly subjective to previous input. However, I do not see how a computer could start making decisions on how good a team is without first getting input from a human on what is good or not. Whether it be training or a preset function. The human programming the data will be the one who, at least primarily, determines who the “best” team is. References http://www.bcsfootball.org/ http://sports.espn.go.com/sports/tvlistings/abcStory?page=aboutbcs Miscellaneous Input File Names (Relative to directory where scripts are) where ? is either a 5 or 4, depending on year 200?data/kickret.htm kick return yardage 200?data/passoff.htm pass offense yardage 200?data/scoredef.htm scoring defense 200?data/totaloff.htm total offensive yardage 200?data/kickretdef.htm kick return defense yardage 200?data/puntret.htm punt return yardage 200?data/scoreoff.htm scoring offense 200?data/turnover.htm turnover margin 200?data/netpunt.htm net punt yardage 200?data/rushdef.htm rush defense yardage 200?data/passdef.htm pass defense yardage Joseph Detmer 2 ECE 539 200?data/rushoff.htm rush offense yardage 200?data/totaldef.htm total defensive yardage 200?data/sos.html strength of schedule Directories (If using my scripts, you must create these directories before exec.) doneclust doneclust2004 parseddata parseddata2004 prefunc Output File FINALRANKINGS.TXT Source Code Only one copy of each script of the nearly identical scripts are attached. Both are included in softcopy. All are attached. Joseph Detmer 3 ECE 539