Exploring the Basketball Database with Spotfire

advertisement
CMSC 734 – Information Visualization
Prof. Ben Shneiderman
Application Report
Exploring the Basketball Database with Spotfire
Morimichi Nishigaki (michi@cs.umd.edu)
Sureyya Tarkan (sureyya@cs.umd.edu)
February 27, 2007
Table of Contents
Introduction ..................................................................................................................... 2
Analysis........................................................................................................................... 2
Mid-Career Players are the Most Successful .............................................................. 2
Best Coaches Suffered from Loss of Their Best Players ............................................ 4
Champion Teams Tend to Lose Their Success in the Next Few Years ...................... 7
Critique of Spotfire ......................................................................................................... 9
References ..................................................................................................................... 11
1
Introduction
With this application project, we aim to explore the data from basketball players, their
stats in the NBA and ABA, team records, coaching records, playoffs, drafts, and regular
seasons. The dataset is taken from [3] and contains historical (from 1946 to 2004) and
numerical information that gives clues about performance results, etc.
The data is arranged as a bunch of text files, e.g. one for players, another for coaches,
teams, seasons, etc. Every field in these text files is organized in rows and columns as a
2D data with commas as separators between the columns and new lines for the rows. The
data for players contains 7,543 items and 23 attributes. The data for coaches contains
1,241 items and 10 attributes. The data for teams contains 1,187 items and 37 attributes.
We intend to find correlations, outliers, and relationships in the dataset that may be useful
for spectators, supporters, and future suggestions. Such information has in fact significant
importance to people who have spent their years to find out good metrics to define
relationships from the stats of a player. These relationships are examined by the managers
and also supporters may have interest in such information.
There are many websites [1] [2] that provide statistical dataset, like the one we have
chosen, but what is important to specialists is that when they define a metric, they are
interested in how much correlated this is with the actual data. The analysis represented
here with the tool Spotfire can improve this process considerably.
Analysis
By exercising with a tiny portion of the dataset, below are some of our findings via line
charts and scatter plots that shall be interesting to the users mentioned above.
Mid-Career Players are the Most Successful
In this part, we try to explore the relationships in the best players’ careers. In order to do
so, we have to find a metric to describe a best player and we have used the Win Score
metric, defined in [4] and [6] as follows:
Win Score = PTS + REB + STL + ½*BLK + ½*AST – FGA - ½*FTA – TO - ½*PF
This metric represents how much a player contributes to his team per 48 minutes. It has
been stated by the author that, there is a .99 correlation between Win Score per minute
and a player’s performance. We are at a large extent convinced that this is a useful metric
by simply looking at the subsequent figure and being able to recognize most of the
players represented.
In the following figure, the scatter plot shows NBA players that in their entire playoff
career, have played at least 3000 minutes and have at least .12 Win Score per minute. By
using these values we wish to select the players who played most with a reasonable value
2
of Win Score per minute. The scatter plot points are sized according the total number of
games played by a player and the color intensity shows how big the Win Score value is
for a given player.
Wilt
Chamberlain
Size by
Number of Games
237
Win Score per Minute
75
Color by Win Score
Bill
Russel
l
Min (396.5)
Max (4144.0)
Kareem
Abdul-jabbar
Larry
Bird
Magic
Johnson
Robert
Horry
Michael
Jordan
Kevin Robert
Mchale Parish
Karl
Malone
Scottie
Pippen
Byron
Scott
Each Player
Figure 1: Win Score per minute for each Player in the Playoffs
Although we have looked at each player’s career above, because of the space constraints,
we have arbitrarily chosen two players from Figure 1, who are Larry Bird and Kevin
Mchale to show their playoff careers. As can be seen easily with the next figure, both
players have the maximum of the sum of Win Score points nearly in their mid-career.
This is rather expected but now has been proven by our analysis.
3
Sum of Win Score
Year
Sum of Win Score
Figure 2: Larry Bird’s Playoff Career
Year
Figure 3: Kevin Mchale’s Playoff Career
Best Coaches Suffered from Loss of Their Best Players
In this part of our report, we show the best coach in the sense of winning averages and
their careers. The dataset contains coaches’ number of wins and number of losses from
1946 to 2003. In view of the expected reader of this report, it is filtered to show only after
1980’s career.
The next scatter plot shows the wins and losses in regular season. In the figure, each plot
represents one coach; the horizontal axis is the sum of season losses, and the vertical axis
is the sum of season wins. The size of each plot represents a coach’s career years, thus,
the longer a coach’s career is, the larger the plot is. The best fitting line is drawn to show
4
the expected winning average. It indicates that plots far above the line are of high
winning average.
Season Win [times]
Pat
Riley
Phil
Jackson
Size:
Years in Coaching
23
1
Season Loss [times]
Figure 4: Wins vs. Loses for each Coach
It is recognized that two coaches are significantly outstanding in the figure above, who
are Phil Jackson and Pat Riley. These two coaches’ careers are focused on in the
following analysis.
The number of season wins and losses per year for Phil Jackson are drawn in the next line
chart. In his career, it can be seen that he had the worst season in 1994 and suddenly
revived in the next year. This bump in his career is related to the loss of Michael Jordan
in his team. Michael Jordan announced his retirement in 1993, and returned to his team in
1995.
5
Win and Loss [times]
Chicago Bulls
LA Lakers
win
loss
Michael Jordan
retired
Michael Jordan
returned
Year
Figure 5: Career of Phil Jackson
Win and Loss [times]
The next line chart shows Pat Riley’s career drawing the number of wins and the number
of losses in each year. It can be seen that he had great seasons in most of his career except
after 2001. In this year his team lost two best players for trades, Tim Hardaway and
Anthony Mason.
NY Knicks
LA Lakers
Miami Heat
win
loss
Year
Lost two best players,
Tim Hardaway & Anthony
Mason
Figure 6: Career of Pat Riley
It is interesting to know that both best coaches had experiences of the worst season due to
the loss of best players. Team managers should know the following fact: Even though
6
those coaches are two of the best coaches figured out from the data, it was hard for them
to recover their team for a few years from the team’s loss of best players.
Champion Teams Tend to Lose Their Success in the Next Few Years
Apart from looking at players’ and coaches’ data, we also tried to discover remarkable
information in the seasons of teams’ dataset. One of the diagrams we constructed showed
the total number of wins per year for each team.
By scrutinizing the data, we eliminated the teams’ data that never won championships.
By doing some historical search, we found each year’s (1946-2004) champion team [5].
Then, we observed that the teams who won championships are doomed to lose this
success in the upcoming few years. Below, we show three of such teams, Detroit Pistons,
Houston Rockets, and Philadelphia 76ers.
7
Team : DET
Sum of Won
Championship
Year
Sum of Won
Team : HOU
Championship
Year
Sum of Won
Team : PHI
Championship
Year
Figure 7: Teams Season
8
The first team in the above figure, i.e. Detroit Pistons, has won championships in years
1988 and 1989. Just four years after these successful years, in 1993, it could only win 20
matches in a total of 82 matches.
Secondly, Houston Rockets has won championships in years 1993 and 1994 but in 1998,
its number of wins dropped to 31.
Finally, Philadelphia 76ers won a championship in 1966 but won only 9 of the matches in
year 1972. Also, in year 1982 it has won another championship and its wins has dropped
to 36 in 5 years and to 18 in 1995.
It is clear that, there is something happening after a win of a championship that the
specialists should answer. Perhaps, it is due to the loss of the players after a
championship, or to the loss of motivation in the team, etc.
Critique of Spotfire
Spotfire facilitates information visualization in many ways. One of them is it depicts
some useful relationships in the data that is hard to observe at first glance. Another
advantage is that it quickly generates charts from a dataset and relates several of them
allowing users to interactively explore the details. Furthermore, for our dataset, Spotfire
accepted the data format, so we did not need to make any modifications to it.
To summarize its advantages, first, the demonstration video is well prepared and
designed for users to easily access it from Spotfire at the very beginning. This helps
beginner users grasp ideas about what they can do with Spotfire.
Moreover, user interfaces are designed for users to switch axes easily from a column to
another column in dataset, thus, this interface speeds up the data analysis to find
meaningful representations in the data. Furthermore, due to flexible filtering functions,
users can easily narrow down the data to focus on a particular interesting portion of data.
One of the powerful options in Spotfire is the aggregate functions when creating
histograms. Custom expressions are supported for aggregate functions but its usage is
rather complicated and mostly error-prone for the user.
Apart from its advantages, one of the disadvantages of Spotfire is that it is not designed
for general use; instead, it is mainly intended for specific usage so many novice users can
have problems understanding it. For instance, we had difficulties in figuring out what the
presented charts mean. Apart from that, some of its features were inappropriately named
confusing its users even more. We had to go over the help topics to catch these
misleading features.
Another disadvantage is that Spotfire is mainly focused on analyzing data, but not on
making a report. One of the disadvantages for making a report using Spotfire is its
inflexibility of the legend layout. Spotfire does not let users place the legend on a graph,
9
but only show or hide it. Furthermore, it is not convenient for making figures for reports
that users cannot arbitrarily set the names of axes. In addition, even though labels are
important for making report figures, users cannot change the layout of the labels.
One other restriction when using Spotfire is that when some data is selected on a figure,
details are shown on details-on-demand area. It is expected that user’s selection on the
details-on-demand cause change of view into more detail like focusing on a specific row.
Such function was actually demanded when we use Spotfire, but it is not implemented.
Yet another limitation we encountered was Spotfire recognizes types of column value,
such as string, integer and real. Even though the type is integer, Spotfire shows the axis
labels of a graph with decimal fraction as default, but it should be shown as integer to
avoid users’ confusion.
Finally, Spotfire supports its users to make multiple pages in a file; however, duplicate
pages are linked together. It has a function to make a copy of a page but although the
users may make a duplicate page to view the same figure in different settings, changes
such as filtering, in one page reflect to the other page because of this linkage. There
should be an easy way to cut this link, or even duplicated pages should not be linked on
settings.
Although mostly robust, while using the tool, we have found some bugs. One flaw is that
after the data is loaded and come data is intended to be omitted, then there is an option in
Spotfire to delete the corresponding rows. However, when the chart is saved and loaded
again, Spotfire reads in the original data not the charts saved so it starts including the
rows being omitted before. The logical assumption when using such a tool is loading the
data once and creating the visualization and saving the visualization itself not the data.
Since we are using the tool for the first time, it might be the case we forgot to set some
options to overcome the problems stated above. By a large extent, Spotfire has simplified
our task and without it, it would be hard to see such insight about such a data.
10
References
[1] http://82games.com/
[2] http://www.basketball-reference.com/
[3] http://www.databasebasketball.com/
[4] http://dberri.wordpress.com/
[5] http://www.nba.com/history/finals/champions.html
[6] http://www.wagesofwins.com/
11
Download