write_up

advertisement
Soccer Match Predictor
The short program uses a bunch of input parameters to predict the outcome of a soccer
match. The essential idea is to give players ratings, then introduce an element of randomness into
their performance, and finally pick a winner. The main program declares eleven players for each
team, with the positions of the declared player objects corresponding to the formation played by
the team. Then the program assigns each player a randomized rating, and predicts the outcome
using an overall team rating which simply corresponds to the sum of all the random player ratings.
Each position class is derived from a player base class. This is because all players of
different positions have certain characteristics in common. They all have an intelligence, a “spirit”
(determination) and an athleticism rating, all of which are equally important to any player. Each
player has a base rating corresponding to the average of these three basic characteristics. They
also all have a normal distribution associated with their overall rating, which is a way of
incorporating the fact that no player performs perfectly consistently. This normal distribution is
defined in the <random> header file, which adds random number functionality to C++11.
All the players also have an overall positional rating, defined as a weighted average of all
their position-specific characteristic ratings. This positional rating is weighted depending on the
style the team plays. This is what the style data member corresponds to in the player base class. If
the player has style one (1), he plays for an attacking team, and if the player has style two (2), he
plays for a team that employs a defensive system. If the player is in an attacking system, his
attacking attributes are weighted more heavily in the calculation of his positional rating, and vice
versa if the player plays in a defensive system. This feature was included in order to incorporate
the factor of overall team chemistry. There are many teams with essentially equally talented
players, but the most successful teams buy the right players for a particular system. There have
been many instances where on paper, just based on the quality of individual players, a team
should have been contending for championships, and ends up delivering a totally mediocre
season. This model includes this sports phenomenon.
Also, attacking players tend to perform more inconsistently, as attacking soccer requires
more risk and is often more difficult to execute. However, attacking players also have more
freedom, and therefore do extraordinary things more often. Therefore, the standard deviation
associated with the normal distributions of attacking players is greater than that of a defensive
player, who is more consistent, but less likely to perform significantly above or below his average.
This is taken into account by the “create_distribution” method, which assigns the standard
deviation and average to be used in constructing the normal distributions of each player.
The friend template function “set_rand_rating” picks a value randomly from the normal
distribution data member of each player and assigns it to the random_rating data member of the
corresponding player. It is this rand_rating data member that is most significant in determining
the winner/loser of the match. First, each player of each team is assigned a rand_rating. Then, each
team is assigned an overall team rating, which is just equal to the sum of all the rand_ratings of
each player on the team. The team with the highest team rating wins the match.
In order to test the model, a trial simulation was made using Arsenal F.C’s and Chelsea F.C’s
current 2014-2015 rosters, with starting 11’s chosen and rated by myself. Chelsea was the
Premier League Champion this season and the stronger of the two sides, although not by much.
Arsenal finished in 3rd. Therefore, the average player ratings of the Chelsea players are a bit
higher, implying that if the model did not include randomness, Chelsea would simply win very
game. Of course this is not realistic, so the Gaussian player-rating probability distributions were
introduced. However, these needed to be tweaked, for if they were chosen too close to one
another, whichever team had the higher team rating would win just about every time. Much
stronger teams lose to even much weaker teams relatively often, a fact that needed to be
considered. For the trial simulation, Chelsea and Arsenal are fairly evenly matched, with Chelsea
having a slight advantage. The model was tuned in order to produce statistical distributions that
realistically correspond to hypothetical matches between these two teams.
The histograms below show results for 4 runs with different values for the standard
deviation of the attacking team, which in this case is Arsenal. The Chelsea standard deviation was
selected to be 1.5 rating points, a reasonable number considering player ratings are out of 10
possible points. The Arsenal rating was adjusted from this base value. Each histogram
corresponds to one simulation, each of which consists of 1000 program runs, each of these runs
themselves being 100 games between the two teams.
(a)
(b)
(d)
(c)
(d)
The histograms are distributions of the variable corresponding to the difference between the
number of Chelsea wins and the number of Arsenal wins, i.e (Chelsea Wins) – (Arsenal Wins). A
positive value means Chelsea won more games out of 100, and vice versa. The first plot has a mean
value of 41, meaning that in this trial Chelsea won on average 41 more games out of 100 than
Arsenal. This is unreasonably high considering the high and comparable quality of both teams this
season. The standard deviation for the Arsenal team was increased to improve this result,
permitting Arsenal to more significantly over-perform in the model. The Arsenal standard
deviation corresponding to plot (c) is, on the other hand, probably too high. It is very unlikely that
out of 100 games Arsenal would ever win more games than Chelsea. The values in (b) are
probably closest to a realistic outcome. Out of 100 games, Chelsea wins on average about 30 more
times than Arsenal, and Arsenal essentially never wins more games out of 100.
The plots below correspond to the four simulations plotted in the histograms above, and
help to demonstrate the amount of randomness in the results. These consider only the first 100 of
the 1000 runs so that the fluctuations can be more clearly seen. The plots on the left show the
fluctuation of the results over time, (the 100 points are connected by a smooth curve to emphasize
oscillation, and note again that each point on the plots corresponds to 100 games between the two
teams.) where time is an implicit variable and runs from left to right, meaning the leftmost points
of the plot were calculated earliest. The plots on the right are simply scatter plots of the first 100
runs.
In the future, it would be interesting to run the model for two teams where the attacking
side is the slightly stronger of the two teams and see if the standard deviations chosen for the
model also produce accurate results for that scenario. This seems like the most likely outcome, but
it would be important to verify this. Otherwise, the model would have to be redesigned, as it is of
course intended to accurately predict the outcomes of football matches between teams of all types.
It would also be interesting to build a model with subtler team styles. For example, instead of just
attacking or defensive, one could add to these styles, such as attacking possession-based style, or
defensive counter-attack style. Since football managers often change such specific team tactics
from game to game, this would allow one to use the model to predict outcomes of individual
games more accurately, by incorporating the tactics each manager most probably will use for the
specific opponent. The model may be a little input-intensive, but could possibly be turned into an
interesting phone application or something along those lines with further development and
testing.
Download