LEVYIlluminx

advertisement
To sports fans and teams, the scoreline of a match rarely tells an adequate story – it
requires many supplementary, individual statistics. In most free-flowing sports like
basketball or soccer, teams have never had sufficient data or powerful enough analysis
tools to find truly significant statistics. With the rise of big data analytics in the computer
science industry, however, there is a growing trend among clubs and fans towards the
establishment of sports analytics communities – seeking more and better statistics from
the wealth of data that new technologies are producing. In the future of sports, it seems
almost inevitable that these analytics will create game-changing conclusions.
Numbers Don’t Lie: The Rise of Big Data in Sports Analytics
by Tal Levy
The study of objective numbers and statistics in baseball – sabermetrics – irrevocably
changed professional baseball when it rose to popularity in the late 20th century. Today,
the abundant wealth of statistics available regarding professional baseball matches,
leagues and players, as well as the recent novel and film Moneyball are testaments to
the popularity of objective statistics among American sports fans. But despite
occasional attempts, no other sport has succeeded at finding nearly as many valuable,
objective statistics on a consistent basis, largely due to the complexity of spatial and
time data in most other sports. However, the rise of so-called “big data” analysis and
increasingly complex computer science techniques have driven a recent rise in
computational sports analytics in spatio-temporal sports like basketball and soccer. With
rapidly advancing computing technologies, both clubs and fans are encouraging the
growth of sports analytics to draw potentially game-changing statistical conclusions.
Figure 1: Scorecard from the first ever perfect game in
baseball, 1880. Source: Wikimedia Commons
Figure 2: Scorecard from a 2000 regular
season game. Source: Wikimedia commons
The Growth of Sports Stats
The recording of most basic stats have nearly always been a part of most sports. Henry
Chadwick, the “father of baseball” invented the box score for recording each event of a
baseball game, in a manner that is very similar to scorekeeping today. In baseball,
nearly all player decisions and game events are distinct events with easily defined and
recorded outcomes, making statistical analyses significantly easier than more freeflowing sports. It is therefore unsurprising that the first studies applying statistical
modeling and analysis to baseball team management were published as early as 1959
[1]. Through the next several decades, the use of sabermetrics, so-named after the
1970s-founded Society for Baseball Research (SABR), gained in popularity with a
notable, steady increase in scholarly, sports analytics articles [2]. In baseball, this trend
may have been especially apparent over the past 15 years, with the publication and
following film adaptation of Moneyball, a novel about the use of data and modeling to
improve the performance of the Oakland Athletics [1].
In other sports such as football, individual stats categorizing touchdowns and yard gains
can be found as early as 1921 [3], but until very recently, the complex dimensionality of
individual actions in these sports have made it impossible for comprehensive statistics
to inform future play. It was only in the 2000s that journals devoted to quantitative
analysis of sports in general arose, with the founding of the Journal of Quantitative
Analysis in Sports and the Journal of Sports Economics [1]. Still, as late as 2005, only a
handful of National Basketball Association (NBA) teams used any advanced statistics in
normal operations, with many sports team administrators actively resisting new insights
brought in by data rather than experience [1]. In soccer, resistance to the use of
statistics is even stronger, with arguments that the free-flowing “beautiful game” simply
can’t be valuably quantified by statistics [4]. Soccer analytics progress is further set
back by highly independent European clubs who make efforts to restrict data outside of
their own analytics departments, for fear of giving up competitive advantages [5].
Getting the Data
The first step in drawing valuable analysis about sports is acquiring accurate and
valuable data. Again, though many significant baseball statistics can be calculated and
recorded by hand from watching individual plays, the positioning and relative movement
of 10 to 22 players in basketball, soccer, or football are significantly more difficult to
capture.
Figure 3: STATS infographic of SportVu at a soccer match.
Source: Stats LLC.
With advances in HD video capture combined with powerful video-analysis tools,
companies such as STATS LLC or ProZone are able to capture and digitize the
locations of each player and the ball in football, soccer, and basketball matches.
Though the exact methods vary between different sports and companies, the basic
principles are the same – by setting up several HD video cameras and running the
video through computer vision algorithms, the operator is able to record every player’s
location as X, Y, and Z coordinates at each frame of video [6]. With this data, sports
analysts have enough data to recreate entire matches quantifiably, and begin more
complex analysis.
A recreation of a basketball match based solely on numerical position
and event data, generated using Stats SportsVU cameras.
Data Processing Infrastructure
Once positional data has been collected, however, the challenge of processing the data
is far from solved. In a game like basketball, comprehensive data collection tools
capture as many as 80,000 frames per game, yielding over 800,000 data points in
player positions alone. When an analyst wants to find trends across several games or
generate his own statistics, the number of data points quickly reaches the hundreds of
millions – enough that complex analyses can take several hours, even on a very
powerful computer. Fortunately, sports analysts seeking to parse these data sets benefit
from a multitude of techniques developed in the last two decades with the explosion of
the internet and other computing technologies.
Internet giant Google built its business on selling targeted advertisements based on
individuals users and user searches, and quickly ran into the “big data problem” of
maintaining its index of internet links as well as collecting data about each user’s
searches to improve its algorithms. Google dealt with this problem by inventing and
publishing a research paper on the “MapReduce” algorithm, which used two existing
techniques in programming to describe an elegant pattern for processing a huge
amount of data very quickly by splitting up work between several computers [7]. By
carefully implementing this pattern, a user can analyze the millions of data points
generated in a basketball match far more quickly than on a single computer.
Figure 4: Abstract Diagram of a MapReduce operation. Several workers split up
the data and write (“map”) their results to an intermediate area, where more
workers compile (“reduce”) the results to a single output.
Source: Google Research Publication – MapReduce
Accompanying the rise of so-called “distributed processing” technologies is the
emergence of cheap computational power. As powerful computers get cheaper,
companies such as Amazon have begun to offer web services, often marketed as
“cloud computing”, to consumers and businesses. For hourly rates as low as seven
cents per hour per computer [8], data analysts can rent remote access to Amazon’s
“data centers” – large facilities with tens if not hundreds of thousands of powerful
computers. With such powerful infrastructure, a sports analyst can run an analysis on
optical data on hundreds of games, with hundreds of millions of data points, and receive
an answer in just a few seconds.
The State of Analysis
In the past few years, the field of statistical sports analytics seems to have grown almost
exponentially, with a study finding over 170 scholarly sports analytics articles published
in 2009, compared with fewer than 100 in 2005 and fewer than 50 in 2000 [2]. The MIT
Sloan Sports Analytics Conference (SSAC), which hosts figures from sports league
worldwide as well as research groups and sports statistics enthusiasts, attracted over
2000 attendees in 2011, and more than half of the 32 NBA teams have devoted
analytics departments within the organization [1]. Despite some mistrust of data
analytics in some football organizations, many teams employ data analysts on an
ongoing basis, allowing them influence in key strategic decisions [1].
Outside of the U.S. sports scene, many soccer clubs are investing in sports analytics.
Though the world of soccer lags behind many other sports – only one of over 100
papers at the 2012 SSAC was based on soccer – the high value of the worldwide
soccer market between television deals, tournament awards, merchandising, and player
transfers has led many clubs and leagues to invest in statistical analytics, with dozens
of clubs using data from Opta, a leading soccer data company, for their internal scouting
and strategy decisions [9]. Though these internal analytics departments are far from
replacing in-person scouting for most sports teams, they have already influenced teams’
strategies.
An Analytics Community
Outside of internal analytics, there’s a growing sports analytics community – fans who
are not directly associated with sports teams, but have an active interest in the
conclusions they can draw from statistics. Baseball’s sabermetrics originated with an
interested community of unaffiliated fans, and most sports leagues hope to encourage
similar communities amongst bloggers, research groups, other fans [9]. At USC’s
Information Sciences Institute, researchers in the Computational Behavioral Group are
using STATS data from NBA games to draw conclusions about basketball shots and
rebounds, earning themselves the “Best Paper” award at the SSAC in 2012 – one
conclusion suggesting that many teams are failing to get close enough to the basket to
achieve offensive rebounds. Today, they are working on drawing more conclusions
regarding different types of plays, positioning, and defense [5].
https://www.youtube.com/watch?v=o_LXI708Yls
Despite the huge worldwide soccer fanbase, the use of data analytics in the public
realm has been relatively limited, with only a few analysis sites publishing limited, if
insightful articles [9]. This past season, however, Manchester City – one of the
wealthiest soccer clubs in the English Premier League – announced a plan to give any
would-be soccer analyst access to nearly all of City’s database, with the intention of
growing the soccer analytics community as a whole [9]. Major League Soccer, the firsttier soccer league in the United States and Canada, has all of its league matches
analyzed by Opta. The league regularly releases articles highlighting interesting details
from Opta data, allows public access to a limited view of the data, and plans to fully
releasing the data to fans in the near future [10]. In doing so, these groups hope to
encourage the general soccer community to build an analytics community, potentially
finding results to help improve the sport, just as the baseball community did in the 70s,
and communities in basketball and football are doing now.
Conclusion
Professional sports teams have always been concerned with improving their play, and
every sport has maintained some statistics in an attempt to measure that progress. As
the value of success in sports rises, teams and fans increasingly look for more
comprehensive analysis, to better measure player quality, good strategy, or season
performance. Until very recently, the tools to draw such conclusions simply didn’t exist
in major sports outside baseball. In the past decade however, radical advancements in
computing, data storage, and computer vision technologies have made an
unprecedented amount of data available to clubs and fans. Increasingly, the only barrier
to interesting, detailed, statistical analysis is an analyst’s creativity, and these changes
are reflected in burgeoning sports analytics communities. With the wealth of data and
the enticement of tackling these never-solved problems as motivation, it seems only a
matter of time before these communities draw conclusions to revolutionize their
sports.References
[1] B. Alamar and V. Mehrotra. “Beyond Moneyball”. Analytics Magazine. Internet:
http://www.analytics-magazine.org/special-articles/391-beyond-moneyball-the-rapidlyevolving-world-of-sports-analytics-part-i
[2] J. Coleman. (March/April 2012). “Identifying the ‘Players’ in Sports Analytics
Research.” Interfaces. Internet. Vol. 42 no. 2. Available:
http://interfaces.journal.informs.org/content/42/2/109.abstract
[3] “1921 Chicago Staleys Statistics”. Pro-Football-Reference. Internet: http://www.profootball-reference.com/teams/chi/1921.htm
[4] S. Fenn, R. Xu, A. Oshansky and D. Pleuler. “The N.A.S.A. Roundtable”. The Shin
Guardian. Internet: http://theshinguardian.com/2013/01/02/the-n-a-s-a-nerds-attacksoccer-analytics-roundtable/, Jan. 02, 2013.
[5] A. Smith. “Can USC researchers change the NBA through science?” USC News.
June 13, 2012. Internet: http://news.usc.edu/#!/article/35821/moneyball-for-basketballusing-science-to-change-the-nba/
[6] “SportVU | Player Tracking Technology.” Internet: http://www.sportvu.com/index.asp
[7] J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large
Clusters.” Google, Inc. Internet: http://research.google.com/archive/mapreduce.html
[8] “Amazon Elastic MapReduce”. Internet: http://aws.amazon.com/elasticmapreduce/
[9] Z. Slaton. “Game Changer: MCFC Analytics Releases Full Season of Opta Data for
Public Use.” Forbes. August 16, 2012. Internet:
http://www.forbes.com/sites/zachslaton/2012/08/16/game-changer-mcfc-analyticsreleases-full-season-of-opta-data-for-public-use/3/
[10] C. Schlosser. Reddit:
http://www.reddit.com/r/MLS/comments/101si9/i_am_chris_schlosser_general_manager
_mls_digital/
Download