baseball parallel coordinates step by step

DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012! baseball parallel coordinates – step by step! ! ! ! ! ! ! ! ! Wed. Nov. 14, 2012 PARALLEL COORDINATES baseball - game percentage top 15 teams in 2011 (from highest to lowest) and 2012 This exercise is based on the parallel coordinates plot described in Chapter 7: Spotting Differences (pp. 251-258) of the book Visualize This. However it uses the plotrix package instead of the lattice package described in the chapter (both are packages used in R that have similar uses). The final complete code can be found here and also at the end of this document. Gather data from one of many baseball sites. In this case the ESPN site, for baseball standings as of 10/31/2011: http://espn.go.com/mlb/standings/_/date/20111031 and 10/31/2012: http://espn.go.com/mlb/standings/_/date/20121031 Simply select the tables to include the headers (American League, etc.) and paste into Excel. You will have to clean up the data in Excel: remove x and other stuff in front of team names, and also remove all headers except the stats headers. Add “team” label in header for team column In Excel, create a file with multiple worksheets so you can keep your data organized. For this exercise, you will need the team names, and the percentage number (PCT). Above 0.500 means a team has won more games than it has lost. Copy and paste the PCT columns from both 2011 and 2012 into a new worksheet. In the new worksheet, I made the decimal 3 digits (format > cells > number > decimal places > 3) to keep the numbers looking the same. Sort the “percent_2011” column from highest to lowest, make sure to expand selection before sorting. For the exercise, we’ll only use the top 15 teams, as shown above. Page 1 of 6 DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012! baseball parallel coordinates – step by step! ! ! ! ! ! ! ! ! Wed. Nov. 14, 2012 Although we should be able to import the dataset directly into R, we will write the data directly into the code instead. This will require two things: 1. The numbers from the percent columns need to be written out sequentially, without quotes, and also separated by commas. Read numbers left to right and top to bottom, in sequence. 2. The team names need to be in quotes and separated by commas. You can do cut-and-paste from Excel into text wrangler or Notepad++ and then do simple search-replaceall to get rid of spaces, returns, and add quotes and commas. You should end up with something like this: for number 1: 0.630,0.500,0.599,0.586,0.593,0.512,0.593,0.574,0.586,0.543,0.580,0.500,0.562,0.556,0. 556,0.426,0.556,0.543,0.549,0.580,0.531,0.549,0.531,0.580,0.509,0.531,0.500,0.451,0.49 7,0.605 for number 2: "Philadelphia","NY Yankees","Milwaukee","Texas","Detroit","Arizona","Tampa Bay","Boston","St. Louis","Atlanta","LA Angels","San Francisco","LA Dodgers","Toronto","Washington" Open RStudio and install the “plotrix” package. Use the install packages button in the plot frame in the RStudio environment, as shown at left. Make sure to “check” the package after installation to activate it. Start a new R script (topleft corner, as shown). Load the data, paste this into the script: # baseball - game percentage # top 15 teams in 2011 (from highest to lowest) and 2012 baseball_top15<matrix(c(0.630,0.500,0.599,0.586,0.593,0.512,0.593,0.574,0.586,0.543,0.580,0.500,0.562 ,0.556,0.556,0.426,0.556,0.543,0.549,0.580,0.531,0.549,0.531,0.580,0.509,0.531,0.500,0 .451,0.497,0.605),ncol=2,byrow=TRUE) rownames(baseball_top15)<-c("Philadelphia","NY Yankees","Milwaukee","Texas","Detroit","Arizona","Tampa Bay","Boston","St. Louis","Atlanta","LA Angels","San Francisco","LA Dodgers","Toronto","Washington") Run the script. The script tells R to load the data as a matrix having rows arranged in two columns. The numbers are read as sets of two, for example the first pair is 0.630,0.500 for Philadelphia. The names of the teams are assigned as the names of the rows. colnames(baseball_top15)<-c(2011,2012) The above labels the Y axis on the left (2011) and on the right (2012). Nothing is plotted yet. Page 2 of 6 DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012! baseball parallel coordinates – step by step! ! ! ! ! ! ! ! ! Wed. Nov. 14, 2012 Next write and run this: bumpchart(baseball_top15,main="Baseball game percentages - Top 15 teams in 2011") The bumpchart function in plotrix plots the data but simply ranks each team from highest to lowest on both columns and connects the two points with a line. This is actually quite useful when all that is needed is the comparison between the teams, not the actual percentages. R should plot the graph shown below: Notice that some teams have the same percentages and share the same dot. However, following the lines to the right side to the corresponding team name, clarifies which line is which. This is great: teams, rankings, and labels all with one line of code! With the next commands, we’ll plot the actual percentages and add a scale: # now show the raw percentages and add ticks on left side (1.1) and right side (1.9) bumpchart(baseball_top15,rank=FALSE, main="Major league baseball - Game percentages - Top 15 teams 2011",col=rainbow(5)) The rank=FALSE returns the actual percentages. The col=rainbow(5) colors the lines with 5 random colors which we’ll modify later in Illustrator. Running the script changes the graph quite radically, but we still need the scale, which we’ll add next. Page 3 of 6 DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012! baseball parallel coordinates – step by step! ! ! ! ! ! ! ! ! Wed. Nov. 14, 2012 # margins have been reset, so use par(xpd=TRUE) boxed.labels(1.1,seq(0.320,0.640,by=0.020),seq(0.320,0.640,by=0.020)) boxed.labels(1.9,seq(0.320,0.640,by=0.020),seq(0.320,0.640,by=0.020)) par(xpd=FALSE) Not sure about the par command, but the boxed.labels command adds the tick mark labels. The 1.1 tells it to place the labels just to the right of the 2011 column on the left. And the 1.9 in the second set of labels tells it to place the labels just to the left of the 2012 column on the right. Using 1 will put the labels exactly under the left column, and using 2 will put the labels exactly under the right column. The by=0.020 tells it to space the labels every 0.020. See the screenshot below for the result: Next, export the plot as a PDF file and open it in Illustrator. Before you do anything else, make sure to release any clipping mask, to allow you to edit the file more easily: Select all > Object > Clipping Mask > Release. This will allow you to select individual objects, such as the dots which will likely have become squares in the transfer. Remember the bug about the missing font? (You are missing Adobe Pi Std or something like that). Select all the squares (easier to do in “Preview mode” -View > Preview or Command-Y) and change the font to Zapf Dingbats. This will make them dots (circles) again. Note that every dot is made up of a separate fill and a separate border, you might want to clean that up too.! Page 4 of 6 DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012! baseball parallel coordinates – step by step! ! ! ! ! ! ! ! ! Wed. Nov. 14, 2012 In Illustrator, I deleted all unnecessary boxes. I moved the labels and scales outside the main rectangle, which I made a very light gray. I made all lines except three a dark gray, as well as the dots. I highlighted the best and the worst record from 2011 to 2012 (green=best, red=worst) and made San Francisco blue and thicker. The rest was just fine-tuning. The result is below: Major lea rcentag T 0.640 0.640 0.620 0.620 Philadelphia Washington NY Yankees 0.600 0.600 Milwaukee Texas Detroit Arizona NY Yankees Atlanta San Francisco 0.580 Texas Tampa Bay 0.560 Boston St. Louis Tampa Bay LA Angels Atlanta Detroit St. Louis 0.540 LA Angels San Francisco LA Dodgers 0.520 Milwaukee LA Dodgers Philadelphia Arizona Toronto 0.500 Washington 0.480 0.480 0.460 0.460 Toronto 0.440 0.440 0.420 0.420 Boston Page 5 of 6 DAI 523 Information Design 1: Data Visualization | Trogu | Fall 2012! baseball parallel coordinates – step by step! ! ! ! ! ! ! ! ! Wed. Nov. 14, 2012 CODE FROM EXERCISE: # this R code file is named: top15_teams_baseball_test_03.r # baseball - game percentage # top 15 teams in 2011 (from highest to lowest) and 2012 baseball_top15<matrix(c(0.630,0.500,0.599,0.586,0.593,0.512,0.593,0.574,0.586,0.543,0.580,0.500,0.562 ,0.556,0.556,0.426,0.556,0.543,0.549,0.580,0.531,0.549,0.531,0.580,0.509,0.531,0.500,0 .451,0.497,0.605),ncol=2,byrow=TRUE) rownames(baseball_top15)<-c("Philadelphia","NY Yankees","Milwaukee","Texas","Detroit","Arizona","Tampa Bay","Boston","St. Louis","Atlanta","LA Angels","San Francisco","LA Dodgers","Toronto","Washington") colnames(baseball_top15)<-c(2011,2012) bumpchart(baseball_top15,main="Baseball game percentages - Top 15 teams in 2011") # now show the raw percentages and add ticks on left side (1.1) and right side (1.9) bumpchart(baseball_top15,rank=FALSE, main="Major league baseball - Game percentages - Top 15 teams 2011",col=rainbow(5)) # margins have been reset, so use par(xpd=TRUE) boxed.labels(1.1,seq(0.320,0.640,by=0.020),seq(0.320,0.640,by=0.020)) boxed.labels(1.9,seq(0.320,0.640,by=0.020),seq(0.320,0.640,by=0.020)) par(xpd=FALSE) Page 6 of 6

baseball parallel coordinates step by step

Related documents

Products

Support

baseball parallel coordinates step by step

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib