To sports fans and teams, the scoreline of a match rarely tells an adequate story – it requires many supplementary, individual statistics. In most free-flowing sports like basketball or soccer, teams have never had sufficient data or powerful enough analysis tools to find truly significant statistics. With the rise of big data analytics in the computer science industry, however, there is a growing trend among clubs and fans towards the establishment of sports analytics communities – seeking more and better statistics from the wealth of data that new technologies are producing. In the future of sports, it seems almost inevitable that these analytics will create game-changing conclusions. Numbers Don’t Lie: The Rise of Big Data in Sports Analytics by Tal Levy The study of objective numbers and statistics in baseball – sabermetrics – irrevocably changed professional baseball when it rose to popularity in the late 20th century. Today, the abundant wealth of statistics available regarding professional baseball matches, leagues and players, as well as the recent novel and film Moneyball are testaments to the popularity of objective statistics among American sports fans. But despite occasional attempts, no other sport has succeeded at finding nearly as many valuable, objective statistics on a consistent basis, largely due to the complexity of spatial and time data in most other sports. However, the rise of so-called “big data” analysis and increasingly complex computer science techniques have driven a recent rise in computational sports analytics in spatio-temporal sports like basketball and soccer. With rapidly advancing computing technologies, both clubs and fans are encouraging the growth of sports analytics to draw potentially game-changing statistical conclusions. Figure 1: Scorecard from the first ever perfect game in baseball, 1880. Source: Wikimedia Commons Figure 2: Scorecard from a 2000 regular season game. Source: Wikimedia commons The Growth of Sports Stats The recording of most basic stats have nearly always been a part of most sports. Henry Chadwick, the “father of baseball” invented the box score for recording each event of a baseball game, in a manner that is very similar to scorekeeping today. In baseball, nearly all player decisions and game events are distinct events with easily defined and recorded outcomes, making statistical analyses significantly easier than more freeflowing sports. It is therefore unsurprising that the first studies applying statistical modeling and analysis to baseball team management were published as early as 1959 [1]. Through the next several decades, the use of sabermetrics, so-named after the 1970s-founded Society for Baseball Research (SABR), gained in popularity with a notable, steady increase in scholarly, sports analytics articles [2]. In baseball, this trend may have been especially apparent over the past 15 years, with the publication and following film adaptation of Moneyball, a novel about the use of data and modeling to improve the performance of the Oakland Athletics [1]. In other sports such as football, individual stats categorizing touchdowns and yard gains can be found as early as 1921 [3], but until very recently, the complex dimensionality of individual actions in these sports have made it impossible for comprehensive statistics to inform future play. It was only in the 2000s that journals devoted to quantitative analysis of sports in general arose, with the founding of the Journal of Quantitative Analysis in Sports and the Journal of Sports Economics [1]. Still, as late as 2005, only a handful of National Basketball Association (NBA) teams used any advanced statistics in normal operations, with many sports team administrators actively resisting new insights brought in by data rather than experience [1]. In soccer, resistance to the use of statistics is even stronger, with arguments that the free-flowing “beautiful game” simply can’t be valuably quantified by statistics [4]. Soccer analytics progress is further set back by highly independent European clubs who make efforts to restrict data outside of their own analytics departments, for fear of giving up competitive advantages [5]. Getting the Data The first step in drawing valuable analysis about sports is acquiring accurate and valuable data. Again, though many significant baseball statistics can be calculated and recorded by hand from watching individual plays, the positioning and relative movement of 10 to 22 players in basketball, soccer, or football are significantly more difficult to capture. Figure 3: STATS infographic of SportVu at a soccer match. Source: Stats LLC. With advances in HD video capture combined with powerful video-analysis tools, companies such as STATS LLC or ProZone are able to capture and digitize the locations of each player and the ball in football, soccer, and basketball matches. Though the exact methods vary between different sports and companies, the basic principles are the same – by setting up several HD video cameras and running the video through computer vision algorithms, the operator is able to record every player’s location as X, Y, and Z coordinates at each frame of video [6]. With this data, sports analysts have enough data to recreate entire matches quantifiably, and begin more complex analysis. A recreation of a basketball match based solely on numerical position and event data, generated using Stats SportsVU cameras. Data Processing Infrastructure Once positional data has been collected, however, the challenge of processing the data is far from solved. In a game like basketball, comprehensive data collection tools capture as many as 80,000 frames per game, yielding over 800,000 data points in player positions alone. When an analyst wants to find trends across several games or generate his own statistics, the number of data points quickly reaches the hundreds of millions – enough that complex analyses can take several hours, even on a very powerful computer. Fortunately, sports analysts seeking to parse these data sets benefit from a multitude of techniques developed in the last two decades with the explosion of the internet and other computing technologies. Internet giant Google built its business on selling targeted advertisements based on individuals users and user searches, and quickly ran into the “big data problem” of maintaining its index of internet links as well as collecting data about each user’s searches to improve its algorithms. Google dealt with this problem by inventing and publishing a research paper on the “MapReduce” algorithm, which used two existing techniques in programming to describe an elegant pattern for processing a huge amount of data very quickly by splitting up work between several computers [7]. By carefully implementing this pattern, a user can analyze the millions of data points generated in a basketball match far more quickly than on a single computer. Figure 4: Abstract Diagram of a MapReduce operation. Several workers split up the data and write (“map”) their results to an intermediate area, where more workers compile (“reduce”) the results to a single output. Source: Google Research Publication – MapReduce Accompanying the rise of so-called “distributed processing” technologies is the emergence of cheap computational power. As powerful computers get cheaper, companies such as Amazon have begun to offer web services, often marketed as “cloud computing”, to consumers and businesses. For hourly rates as low as seven cents per hour per computer [8], data analysts can rent remote access to Amazon’s “data centers” – large facilities with tens if not hundreds of thousands of powerful computers. With such powerful infrastructure, a sports analyst can run an analysis on optical data on hundreds of games, with hundreds of millions of data points, and receive an answer in just a few seconds. The State of Analysis In the past few years, the field of statistical sports analytics seems to have grown almost exponentially, with a study finding over 170 scholarly sports analytics articles published in 2009, compared with fewer than 100 in 2005 and fewer than 50 in 2000 [2]. The MIT Sloan Sports Analytics Conference (SSAC), which hosts figures from sports league worldwide as well as research groups and sports statistics enthusiasts, attracted over 2000 attendees in 2011, and more than half of the 32 NBA teams have devoted analytics departments within the organization [1]. Despite some mistrust of data analytics in some football organizations, many teams employ data analysts on an ongoing basis, allowing them influence in key strategic decisions [1]. Outside of the U.S. sports scene, many soccer clubs are investing in sports analytics. Though the world of soccer lags behind many other sports – only one of over 100 papers at the 2012 SSAC was based on soccer – the high value of the worldwide soccer market between television deals, tournament awards, merchandising, and player transfers has led many clubs and leagues to invest in statistical analytics, with dozens of clubs using data from Opta, a leading soccer data company, for their internal scouting and strategy decisions [9]. Though these internal analytics departments are far from replacing in-person scouting for most sports teams, they have already influenced teams’ strategies. An Analytics Community Outside of internal analytics, there’s a growing sports analytics community – fans who are not directly associated with sports teams, but have an active interest in the conclusions they can draw from statistics. Baseball’s sabermetrics originated with an interested community of unaffiliated fans, and most sports leagues hope to encourage similar communities amongst bloggers, research groups, other fans [9]. At USC’s Information Sciences Institute, researchers in the Computational Behavioral Group are using STATS data from NBA games to draw conclusions about basketball shots and rebounds, earning themselves the “Best Paper” award at the SSAC in 2012 – one conclusion suggesting that many teams are failing to get close enough to the basket to achieve offensive rebounds. Today, they are working on drawing more conclusions regarding different types of plays, positioning, and defense [5]. https://www.youtube.com/watch?v=o_LXI708Yls Despite the huge worldwide soccer fanbase, the use of data analytics in the public realm has been relatively limited, with only a few analysis sites publishing limited, if insightful articles [9]. This past season, however, Manchester City – one of the wealthiest soccer clubs in the English Premier League – announced a plan to give any would-be soccer analyst access to nearly all of City’s database, with the intention of growing the soccer analytics community as a whole [9]. Major League Soccer, the firsttier soccer league in the United States and Canada, has all of its league matches analyzed by Opta. The league regularly releases articles highlighting interesting details from Opta data, allows public access to a limited view of the data, and plans to fully releasing the data to fans in the near future [10]. In doing so, these groups hope to encourage the general soccer community to build an analytics community, potentially finding results to help improve the sport, just as the baseball community did in the 70s, and communities in basketball and football are doing now. Conclusion Professional sports teams have always been concerned with improving their play, and every sport has maintained some statistics in an attempt to measure that progress. As the value of success in sports rises, teams and fans increasingly look for more comprehensive analysis, to better measure player quality, good strategy, or season performance. Until very recently, the tools to draw such conclusions simply didn’t exist in major sports outside baseball. In the past decade however, radical advancements in computing, data storage, and computer vision technologies have made an unprecedented amount of data available to clubs and fans. Increasingly, the only barrier to interesting, detailed, statistical analysis is an analyst’s creativity, and these changes are reflected in burgeoning sports analytics communities. With the wealth of data and the enticement of tackling these never-solved problems as motivation, it seems only a matter of time before these communities draw conclusions to revolutionize their sports.References [1] B. Alamar and V. Mehrotra. “Beyond Moneyball”. Analytics Magazine. Internet: http://www.analytics-magazine.org/special-articles/391-beyond-moneyball-the-rapidlyevolving-world-of-sports-analytics-part-i [2] J. Coleman. (March/April 2012). “Identifying the ‘Players’ in Sports Analytics Research.” Interfaces. Internet. Vol. 42 no. 2. Available: http://interfaces.journal.informs.org/content/42/2/109.abstract [3] “1921 Chicago Staleys Statistics”. Pro-Football-Reference. Internet: http://www.profootball-reference.com/teams/chi/1921.htm [4] S. Fenn, R. Xu, A. Oshansky and D. Pleuler. “The N.A.S.A. Roundtable”. The Shin Guardian. Internet: http://theshinguardian.com/2013/01/02/the-n-a-s-a-nerds-attacksoccer-analytics-roundtable/, Jan. 02, 2013. [5] A. Smith. “Can USC researchers change the NBA through science?” USC News. June 13, 2012. Internet: http://news.usc.edu/#!/article/35821/moneyball-for-basketballusing-science-to-change-the-nba/ [6] “SportVU | Player Tracking Technology.” Internet: http://www.sportvu.com/index.asp [7] J. Dean and S. Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Google, Inc. Internet: http://research.google.com/archive/mapreduce.html [8] “Amazon Elastic MapReduce”. Internet: http://aws.amazon.com/elasticmapreduce/ [9] Z. Slaton. “Game Changer: MCFC Analytics Releases Full Season of Opta Data for Public Use.” Forbes. August 16, 2012. Internet: http://www.forbes.com/sites/zachslaton/2012/08/16/game-changer-mcfc-analyticsreleases-full-season-of-opta-data-for-public-use/3/ [10] C. Schlosser. Reddit: http://www.reddit.com/r/MLS/comments/101si9/i_am_chris_schlosser_general_manager _mls_digital/