baseball parallel coordinates step by step

advertisement
DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012!
baseball parallel coordinates – step by step!
!
!
!
!
!
!
!
!
Wed. Nov. 14, 2012
PARALLEL COORDINATES
baseball - game percentage
top 15 teams in 2011 (from highest to lowest) and 2012
This exercise is based on the parallel coordinates plot described in Chapter 7: Spotting Differences (pp.
251-258) of the book Visualize This. However it uses the plotrix package instead of the lattice package
described in the chapter (both are packages used in R that have similar uses).
The final complete code can be found here and also at the end of this document.
Gather data from one of many baseball sites. In this case the ESPN site, for baseball standings as of
10/31/2011:
http://espn.go.com/mlb/standings/_/date/20111031
and 10/31/2012:
http://espn.go.com/mlb/standings/_/date/20121031
Simply select the tables to include the headers
(American League, etc.) and paste into Excel. You
will have to clean up the data in Excel: remove x
and other stuff in front of team names, and also
remove all headers except the stats headers. Add
“team” label in header for team column
In Excel, create a file with multiple worksheets so you can keep
your data organized. For this exercise, you will need the team
names, and the percentage number (PCT). Above 0.500 means
a team has won more games than it has lost. Copy and paste the
PCT columns from both 2011 and 2012 into a new worksheet.
In the new worksheet, I
made the decimal 3
digits (format > cells >
number > decimal
places > 3) to keep the
numbers looking the
same.
Sort the “percent_2011”
column from highest to
lowest, make sure to
expand selection before
sorting.
For the exercise, we’ll only use the top 15 teams, as shown above.
Page 1 of 6
DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012!
baseball parallel coordinates – step by step!
!
!
!
!
!
!
!
!
Wed. Nov. 14, 2012
Although we should be able to import the dataset directly into R, we will write the data directly into the
code instead. This will require two things:
1. The numbers from the percent columns need to be written out sequentially, without quotes, and also
separated by commas. Read numbers left to right and top to bottom, in sequence.
2. The team names need to be in quotes and separated by commas.
You can do cut-and-paste from Excel into text wrangler or Notepad++ and then do simple search-replaceall to get rid of spaces, returns, and add quotes and commas. You should end up with something like this:
for number 1:
0.630,0.500,0.599,0.586,0.593,0.512,0.593,0.574,0.586,0.543,0.580,0.500,0.562,0.556,0.
556,0.426,0.556,0.543,0.549,0.580,0.531,0.549,0.531,0.580,0.509,0.531,0.500,0.451,0.49
7,0.605
for number 2:
"Philadelphia","NY Yankees","Milwaukee","Texas","Detroit","Arizona","Tampa
Bay","Boston","St. Louis","Atlanta","LA Angels","San Francisco","LA
Dodgers","Toronto","Washington"
Open RStudio and install the “plotrix” package. Use the
install packages button in the plot frame in the RStudio
environment, as shown at left. Make sure to “check” the
package after installation to activate it.
Start a new R script (topleft corner, as shown).
Load the data, paste this into the script:
# baseball - game percentage
# top 15 teams in 2011 (from highest to lowest) and 2012
baseball_top15<matrix(c(0.630,0.500,0.599,0.586,0.593,0.512,0.593,0.574,0.586,0.543,0.580,0.500,0.562
,0.556,0.556,0.426,0.556,0.543,0.549,0.580,0.531,0.549,0.531,0.580,0.509,0.531,0.500,0
.451,0.497,0.605),ncol=2,byrow=TRUE)
rownames(baseball_top15)<-c("Philadelphia","NY
Yankees","Milwaukee","Texas","Detroit","Arizona","Tampa Bay","Boston","St.
Louis","Atlanta","LA Angels","San Francisco","LA Dodgers","Toronto","Washington")
Run the script. The script tells R to load the data as a matrix having rows arranged in two columns. The
numbers are read as sets of two, for example the first pair is 0.630,0.500 for Philadelphia. The names of
the teams are assigned as the names of the rows.
colnames(baseball_top15)<-c(2011,2012)
The above labels the Y axis on the left (2011) and on the right (2012). Nothing is plotted yet.
Page 2 of 6
DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012!
baseball parallel coordinates – step by step!
!
!
!
!
!
!
!
!
Wed. Nov. 14, 2012
Next write and run this:
bumpchart(baseball_top15,main="Baseball game percentages - Top 15 teams in 2011")
The bumpchart function in plotrix plots the data but simply ranks each team from highest to lowest on
both columns and connects the two points with a line. This is actually quite useful when all that is needed
is the comparison between the teams, not the actual percentages. R should plot the graph shown below:
Notice that some teams have the
same percentages and share the
same dot. However, following the
lines to the right side to the
corresponding team name,
clarifies which line is which.
This is great: teams, rankings,
and labels all with one line of
code!
With the next commands, we’ll plot the actual percentages and add a scale:
# now show the raw percentages and add ticks on left side (1.1) and right side (1.9)
bumpchart(baseball_top15,rank=FALSE,
main="Major league baseball - Game percentages - Top 15 teams
2011",col=rainbow(5))
The rank=FALSE returns the
actual percentages. The
col=rainbow(5) colors the lines
with 5 random colors which we’ll
modify later in Illustrator. Running
the script changes the graph
quite radically, but we still need
the scale, which we’ll add next.
Page 3 of 6
DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012!
baseball parallel coordinates – step by step!
!
!
!
!
!
!
!
!
Wed. Nov. 14, 2012
# margins have been reset, so use
par(xpd=TRUE)
boxed.labels(1.1,seq(0.320,0.640,by=0.020),seq(0.320,0.640,by=0.020))
boxed.labels(1.9,seq(0.320,0.640,by=0.020),seq(0.320,0.640,by=0.020))
par(xpd=FALSE)
Not sure about the par command, but the boxed.labels command adds the tick mark labels. The 1.1
tells it to place the labels just to the right of the 2011 column on the left. And the 1.9 in the second set of
labels tells it to place the labels just to the left of the 2012 column on the right. Using 1 will put the labels
exactly under the left column, and using 2 will put the labels exactly under the right column. The
by=0.020 tells it to space the labels every 0.020. See the screenshot below for the result:
Next, export the plot as a PDF file and open it in Illustrator. Before you
do anything else, make sure to release any clipping mask, to allow
you to edit the file more easily: Select all > Object > Clipping Mask >
Release.
This will allow you to select individual objects, such as the dots which
will likely have become squares in the transfer. Remember the bug
about the missing font? (You are missing Adobe Pi Std or something
like that). Select all the squares (easier to do in “Preview mode” -View > Preview or Command-Y) and change the font to Zapf
Dingbats. This will make them dots (circles) again. Note that every dot
is made up of a separate fill and a separate border, you might want to
clean that up too.!
Page 4 of 6
DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012!
baseball parallel coordinates – step by step!
!
!
!
!
!
!
!
!
Wed. Nov. 14, 2012
In Illustrator, I deleted all unnecessary boxes. I moved the labels and scales outside the main rectangle,
which I made a very light gray. I made all lines except three a dark gray, as well as the dots. I highlighted
the best and the worst record from 2011 to 2012 (green=best, red=worst) and made San Francisco blue
and thicker. The rest was just fine-tuning. The result is below:
Major lea
rcentag
T
0.640
0.640
0.620
0.620
Philadelphia
Washington
NY Yankees
0.600
0.600
Milwaukee
Texas
Detroit
Arizona
NY Yankees
Atlanta
San Francisco
0.580
Texas
Tampa Bay
0.560
Boston
St. Louis
Tampa Bay
LA Angels
Atlanta
Detroit
St. Louis
0.540
LA Angels
San Francisco
LA Dodgers
0.520
Milwaukee
LA Dodgers
Philadelphia
Arizona
Toronto 0.500
Washington
0.480
0.480
0.460
0.460
Toronto
0.440
0.440
0.420
0.420
Boston
Page 5 of 6
DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012!
baseball parallel coordinates – step by step!
!
!
!
!
!
!
!
!
Wed. Nov. 14, 2012
CODE FROM EXERCISE:
# this R code file is named: top15_teams_baseball_test_03.r
# baseball - game percentage
# top 15 teams in 2011 (from highest to lowest) and 2012
baseball_top15<matrix(c(0.630,0.500,0.599,0.586,0.593,0.512,0.593,0.574,0.586,0.543,0.580,0.500,0.562
,0.556,0.556,0.426,0.556,0.543,0.549,0.580,0.531,0.549,0.531,0.580,0.509,0.531,0.500,0
.451,0.497,0.605),ncol=2,byrow=TRUE)
rownames(baseball_top15)<-c("Philadelphia","NY
Yankees","Milwaukee","Texas","Detroit","Arizona","Tampa Bay","Boston","St.
Louis","Atlanta","LA Angels","San Francisco","LA Dodgers","Toronto","Washington")
colnames(baseball_top15)<-c(2011,2012)
bumpchart(baseball_top15,main="Baseball game percentages - Top 15 teams in 2011")
# now show the raw percentages and add ticks on left side (1.1) and right side (1.9)
bumpchart(baseball_top15,rank=FALSE,
main="Major league baseball - Game percentages - Top 15 teams
2011",col=rainbow(5))
# margins have been reset, so use
par(xpd=TRUE)
boxed.labels(1.1,seq(0.320,0.640,by=0.020),seq(0.320,0.640,by=0.020))
boxed.labels(1.9,seq(0.320,0.640,by=0.020),seq(0.320,0.640,by=0.020))
par(xpd=FALSE)
Page 6 of 6
Download