CMSC 734 – Information Visualization Prof. Ben Shneiderman Application Report Exploring the Basketball Database with Spotfire Morimichi Nishigaki (michi@cs.umd.edu) Sureyya Tarkan (sureyya@cs.umd.edu) February 27, 2007 Table of Contents Introduction ..................................................................................................................... 2 Analysis........................................................................................................................... 2 Mid-Career Players are the Most Successful .............................................................. 2 Best Coaches Suffered from Loss of Their Best Players ............................................ 4 Champion Teams Tend to Lose Their Success in the Next Few Years ...................... 7 Critique of Spotfire ......................................................................................................... 9 References ..................................................................................................................... 11 1 Introduction With this application project, we aim to explore the data from basketball players, their stats in the NBA and ABA, team records, coaching records, playoffs, drafts, and regular seasons. The dataset is taken from [3] and contains historical (from 1946 to 2004) and numerical information that gives clues about performance results, etc. The data is arranged as a bunch of text files, e.g. one for players, another for coaches, teams, seasons, etc. Every field in these text files is organized in rows and columns as a 2D data with commas as separators between the columns and new lines for the rows. The data for players contains 7,543 items and 23 attributes. The data for coaches contains 1,241 items and 10 attributes. The data for teams contains 1,187 items and 37 attributes. We intend to find correlations, outliers, and relationships in the dataset that may be useful for spectators, supporters, and future suggestions. Such information has in fact significant importance to people who have spent their years to find out good metrics to define relationships from the stats of a player. These relationships are examined by the managers and also supporters may have interest in such information. There are many websites [1] [2] that provide statistical dataset, like the one we have chosen, but what is important to specialists is that when they define a metric, they are interested in how much correlated this is with the actual data. The analysis represented here with the tool Spotfire can improve this process considerably. Analysis By exercising with a tiny portion of the dataset, below are some of our findings via line charts and scatter plots that shall be interesting to the users mentioned above. Mid-Career Players are the Most Successful In this part, we try to explore the relationships in the best players’ careers. In order to do so, we have to find a metric to describe a best player and we have used the Win Score metric, defined in [4] and [6] as follows: Win Score = PTS + REB + STL + ½*BLK + ½*AST – FGA - ½*FTA – TO - ½*PF This metric represents how much a player contributes to his team per 48 minutes. It has been stated by the author that, there is a .99 correlation between Win Score per minute and a player’s performance. We are at a large extent convinced that this is a useful metric by simply looking at the subsequent figure and being able to recognize most of the players represented. In the following figure, the scatter plot shows NBA players that in their entire playoff career, have played at least 3000 minutes and have at least .12 Win Score per minute. By using these values we wish to select the players who played most with a reasonable value 2 of Win Score per minute. The scatter plot points are sized according the total number of games played by a player and the color intensity shows how big the Win Score value is for a given player. Wilt Chamberlain Size by Number of Games 237 Win Score per Minute 75 Color by Win Score Bill Russel l Min (396.5) Max (4144.0) Kareem Abdul-jabbar Larry Bird Magic Johnson Robert Horry Michael Jordan Kevin Robert Mchale Parish Karl Malone Scottie Pippen Byron Scott Each Player Figure 1: Win Score per minute for each Player in the Playoffs Although we have looked at each player’s career above, because of the space constraints, we have arbitrarily chosen two players from Figure 1, who are Larry Bird and Kevin Mchale to show their playoff careers. As can be seen easily with the next figure, both players have the maximum of the sum of Win Score points nearly in their mid-career. This is rather expected but now has been proven by our analysis. 3 Sum of Win Score Year Sum of Win Score Figure 2: Larry Bird’s Playoff Career Year Figure 3: Kevin Mchale’s Playoff Career Best Coaches Suffered from Loss of Their Best Players In this part of our report, we show the best coach in the sense of winning averages and their careers. The dataset contains coaches’ number of wins and number of losses from 1946 to 2003. In view of the expected reader of this report, it is filtered to show only after 1980’s career. The next scatter plot shows the wins and losses in regular season. In the figure, each plot represents one coach; the horizontal axis is the sum of season losses, and the vertical axis is the sum of season wins. The size of each plot represents a coach’s career years, thus, the longer a coach’s career is, the larger the plot is. The best fitting line is drawn to show 4 the expected winning average. It indicates that plots far above the line are of high winning average. Season Win [times] Pat Riley Phil Jackson Size: Years in Coaching 23 1 Season Loss [times] Figure 4: Wins vs. Loses for each Coach It is recognized that two coaches are significantly outstanding in the figure above, who are Phil Jackson and Pat Riley. These two coaches’ careers are focused on in the following analysis. The number of season wins and losses per year for Phil Jackson are drawn in the next line chart. In his career, it can be seen that he had the worst season in 1994 and suddenly revived in the next year. This bump in his career is related to the loss of Michael Jordan in his team. Michael Jordan announced his retirement in 1993, and returned to his team in 1995. 5 Win and Loss [times] Chicago Bulls LA Lakers win loss Michael Jordan retired Michael Jordan returned Year Figure 5: Career of Phil Jackson Win and Loss [times] The next line chart shows Pat Riley’s career drawing the number of wins and the number of losses in each year. It can be seen that he had great seasons in most of his career except after 2001. In this year his team lost two best players for trades, Tim Hardaway and Anthony Mason. NY Knicks LA Lakers Miami Heat win loss Year Lost two best players, Tim Hardaway & Anthony Mason Figure 6: Career of Pat Riley It is interesting to know that both best coaches had experiences of the worst season due to the loss of best players. Team managers should know the following fact: Even though 6 those coaches are two of the best coaches figured out from the data, it was hard for them to recover their team for a few years from the team’s loss of best players. Champion Teams Tend to Lose Their Success in the Next Few Years Apart from looking at players’ and coaches’ data, we also tried to discover remarkable information in the seasons of teams’ dataset. One of the diagrams we constructed showed the total number of wins per year for each team. By scrutinizing the data, we eliminated the teams’ data that never won championships. By doing some historical search, we found each year’s (1946-2004) champion team [5]. Then, we observed that the teams who won championships are doomed to lose this success in the upcoming few years. Below, we show three of such teams, Detroit Pistons, Houston Rockets, and Philadelphia 76ers. 7 Team : DET Sum of Won Championship Year Sum of Won Team : HOU Championship Year Sum of Won Team : PHI Championship Year Figure 7: Teams Season 8 The first team in the above figure, i.e. Detroit Pistons, has won championships in years 1988 and 1989. Just four years after these successful years, in 1993, it could only win 20 matches in a total of 82 matches. Secondly, Houston Rockets has won championships in years 1993 and 1994 but in 1998, its number of wins dropped to 31. Finally, Philadelphia 76ers won a championship in 1966 but won only 9 of the matches in year 1972. Also, in year 1982 it has won another championship and its wins has dropped to 36 in 5 years and to 18 in 1995. It is clear that, there is something happening after a win of a championship that the specialists should answer. Perhaps, it is due to the loss of the players after a championship, or to the loss of motivation in the team, etc. Critique of Spotfire Spotfire facilitates information visualization in many ways. One of them is it depicts some useful relationships in the data that is hard to observe at first glance. Another advantage is that it quickly generates charts from a dataset and relates several of them allowing users to interactively explore the details. Furthermore, for our dataset, Spotfire accepted the data format, so we did not need to make any modifications to it. To summarize its advantages, first, the demonstration video is well prepared and designed for users to easily access it from Spotfire at the very beginning. This helps beginner users grasp ideas about what they can do with Spotfire. Moreover, user interfaces are designed for users to switch axes easily from a column to another column in dataset, thus, this interface speeds up the data analysis to find meaningful representations in the data. Furthermore, due to flexible filtering functions, users can easily narrow down the data to focus on a particular interesting portion of data. One of the powerful options in Spotfire is the aggregate functions when creating histograms. Custom expressions are supported for aggregate functions but its usage is rather complicated and mostly error-prone for the user. Apart from its advantages, one of the disadvantages of Spotfire is that it is not designed for general use; instead, it is mainly intended for specific usage so many novice users can have problems understanding it. For instance, we had difficulties in figuring out what the presented charts mean. Apart from that, some of its features were inappropriately named confusing its users even more. We had to go over the help topics to catch these misleading features. Another disadvantage is that Spotfire is mainly focused on analyzing data, but not on making a report. One of the disadvantages for making a report using Spotfire is its inflexibility of the legend layout. Spotfire does not let users place the legend on a graph, 9 but only show or hide it. Furthermore, it is not convenient for making figures for reports that users cannot arbitrarily set the names of axes. In addition, even though labels are important for making report figures, users cannot change the layout of the labels. One other restriction when using Spotfire is that when some data is selected on a figure, details are shown on details-on-demand area. It is expected that user’s selection on the details-on-demand cause change of view into more detail like focusing on a specific row. Such function was actually demanded when we use Spotfire, but it is not implemented. Yet another limitation we encountered was Spotfire recognizes types of column value, such as string, integer and real. Even though the type is integer, Spotfire shows the axis labels of a graph with decimal fraction as default, but it should be shown as integer to avoid users’ confusion. Finally, Spotfire supports its users to make multiple pages in a file; however, duplicate pages are linked together. It has a function to make a copy of a page but although the users may make a duplicate page to view the same figure in different settings, changes such as filtering, in one page reflect to the other page because of this linkage. There should be an easy way to cut this link, or even duplicated pages should not be linked on settings. Although mostly robust, while using the tool, we have found some bugs. One flaw is that after the data is loaded and come data is intended to be omitted, then there is an option in Spotfire to delete the corresponding rows. However, when the chart is saved and loaded again, Spotfire reads in the original data not the charts saved so it starts including the rows being omitted before. The logical assumption when using such a tool is loading the data once and creating the visualization and saving the visualization itself not the data. Since we are using the tool for the first time, it might be the case we forgot to set some options to overcome the problems stated above. By a large extent, Spotfire has simplified our task and without it, it would be hard to see such insight about such a data. 10 References [1] http://82games.com/ [2] http://www.basketball-reference.com/ [3] http://www.databasebasketball.com/ [4] http://dberri.wordpress.com/ [5] http://www.nba.com/history/finals/champions.html [6] http://www.wagesofwins.com/ 11