Quick introduction to descriptive statistics and graphs in R Commander Written by: Robin Beaumont e-mail: robin@organplayers.co.uk http://www.robin-beaumont.co.uk/virtualclassroom/stats/course1.html Date last updated Wednesday, 24 April 2013 Version: 2 Contents Boxplots ............................................................................................................................................................................. 2 Percentages for each category/factor level ...................................................................................................................... 3 Summaries for a interval/ratio variable divided across categories (factor levels) ........................................................... 3 Histograms ........................................................................................................................................................................ 4 Density plots...................................................................................................................................................................... 5 Densityplots for subgroups defined by factor levels ........................................................................................................ 6 Graphical summaries of data - aggregation ...................................................................................................................... 7 Aggregating data ..................................................................................................................................................... 11 Boxplots From within R you need to load R commander by typing in the following command: library(Rcmdr) First of all you need some data and for this example I'll use the sample dataset, by loading it directly from my website. You can do this by selecting the R commander menu option: Data-> from text, the clipboard or URL Then I have given the resultant dataframe the name mydataframe, also indicating that it is from a URL (i.e. the web) and the columns are separated by tab characters. Clicking on the OK button brings up the internet URL box, you need to type in it the following to obtain my sample data: http://www.robin-beaumont.co.uk/virtualclassroom/stats/basics/coursework/data/pain_medication.dat 6 4 2 time 8 10 12 This dataset has 7 variables of which we are only interested in two here; time (the outcome variable) and dosage a grouping variable indicating which group the result ('time') belongs to. High Low dosage Percentages for each category/factor level Using the dataset from the boxplots example. Taking a single variable we can obtain the counts for each category + percentage in R commander. Consider we wanted to know what the number and percentage of cases are in each group, that is within each category (level) of the dosage variable. The dosage variable is a grouping variable = nominal data, and each value is said to represent a factor level. Summaries for a interval/ratio variable divided across categories (factor levels) We can obtain simple descriptive statistics using the menu option show opposite we can also find these for subgroups by using the Summarize by groups option. Histograms Say we wanted to see the distribution of ages in our dataset, you have three options usually you would only show one in a report. 20 0 10 frequency 30 40 Frequency counts: 30 Percentages: 40 50 60 70 80 20 mydataframe$age 0.04 50 60 70 80 density 40 0.01 mydataframe$age Note the dataframe dollar column name format i.e. mydataframe$age description of the x axis. 0.00 30 0.02 0.03 10 5 0 percent 15 Density histogram 30 40 50 60 mydataframe$age 70 80 Density plots A density plot is a smoothed version of a histogram its very useful. Unfortunately there is no r commander menu option to produce them so you need to type the command: plot (density(dataframe name $ column name)) So for our dataframe which we have called mydataframe and the column called age within it we type; plot( density ( mydataframe$age)) 0.02 0.01 0.00 Density 0.03 density.default(x = mydataframe$age) 20 30 40 50 60 N = 200 Bandwidth = 3.239 70 80 90 Densityplots for subgroups defined by factor levels There are many ways and the easiest is to use the lattice package introduced latter in the course but for now just considering the gender variable which has only 2 levels we can do the following: First copy only the male cases into a dataframe called maledata: select only rows where gender =male maledata <- mydataframe[mydataframe$gender == "Male",] note the double = = to mean "is equal to" and all the columns in the dataframe the comma is important Now copy only the female cases into a dataframe called femaledata: select only rows where gender =female femaledata <- mydataframe[mydataframe$gender == "Female",] note the double = = to mean "is equal to" and all the columns in the dataframe the comma is important Now create our densityplot plot the densities of . the male ages set the y axis limits to 0 to 0.07 set the x axis label to read . . . . . plot(density(maledata$age), ylim = c(0, 0.07), main = "densityplots for males/females[dotted] for age", xlab= "age (years)" ) set the main title of the graph to read . .. ... Now need to superimpose the female density line. set the line type to 2 which is dotted to differentiate it from teh default line type solid lines(density(femaledata$age), lty = 2) Graphical summaries of data - aggregation Problem: we want to show hourly wage against years working at a health institution and have the data in the following format. First obtain either the healthwagedata.sav or the healthwagedata.rda, file from the url below and store it on your local machine. http://www.robin-beaumont.co.uk/virtualclassroom/book2data/healthwagedata.rda or http://www.robin-beaumont.co.uk/virtualclassroom/book2data/healthwagedata.sav The top left screenshot shows how to load the rda file. We see there are many entries for each yrsscale (time worked with institution). While the hourwage shows the average hourly wage. (top right) Before we do anything let's check what the summary values are for each level of employment time using the menu option statistics -> summaries -> numeric summaries and setup the dialog box as shown opposite. Clearly the mean and median hourly rate go up with years employment, from 18 to 21.63 Because of the multiple hourly wage values for each level of employment time a scatter plot of the raw data is not appropriate but we have two options: produce a series of boxplots or means or each group or aggregate the data, for example find the mean at each hourly wage against employment time and then plot these values. We can easily produce a boxplot of the above findings. 657 2324 20 10 15 By selecting the identify outliers option: automatically we have the case numbers marked. 522 1225 5 268 319 5 or less 1972 6-10 2758 2728 1378 18281669 2740 2668 1396 11-15 16-20 2785 511 2125 21-35 2839 2977 36 or more 25 30 yrsscale 10 15 20 By selecting the identify outliers option we now have a clearer, but possibly less useful graph. 5 hourwage hourwage 25 30 1488 2078 1415 1585 5 or less 6-10 11-15 16-20 yrsscale 21-35 36 or more Asking the question what do the many outliers suggest? would require knowledge of the context in which the data was collected they might be miscoded values or a particular distinct subset of employees such as consultants and a definitive answer needs detailed knowledge of the environment from where the data was collected. Ignoring the outliers and assuming that the data are normally distributed at each no of years employment level we can produce a graph of means at each level along with a indication of range. Graphs->plot of means Selecting the standard errors option we can see the estimated accuracy of the mean for each group I feel that presenting the data like this possibly does it a disservice as it now appears very clean giving no indication of those very low and high paid workers! 20 19 18 mean of mydataset$hourwage 21 22 Plot of Means 5 or less 6-10 11-15 16-20 mydataset$yrsscale 21-35 36 or more Notice that the x categories are in the correct order but this is not always the case, the rda and sav files contained additional information specifying the factor level order. However if we had used a plan text file (i.e. .dat or .txt) you would have needed to reorder the factor levels by using the R Commander menu option: Data ->Manage variables in active dataset->Reorder factor>levels The alternative strategy is to produce a new dataframe which only consists of the summary values. To do this we first need to remove all those rows which have empty values for either the hourwage or yrsscale variables. data->active data set->remove cases with missing data See opposite. I have called the new dataframe cleandataframe. Notice that the new dataframe is automatically loaded. The new dataframe has 89 less records Aggregating data Aggregating data and new datasets from the aggregated values is a common occurrence with large datasets and this scenario provides you with a good example. Having removed all the cases with missing data we can now create a newdataframe with just the aggregated data (i.e. the means) by selecting the menu option: Then setup the dialog box as shown opposite. Notice that the new dataframe is automatically loaded. The new dataframe has 6 records. Clicking on the edit data set button we can edit the new dataframe. When you have finished make sure you close it by clicking on the X button on the top right hand side of the window. The next stage is to produce a scatterplot of the means against year, however we can only do this when we have at least two interval/ratio variables in the dataframe else the R commander scatterplot menu option is grayed out. Which it would be if you tried with the current dataframe. However this is easily fixed by changing the yrsscale variable from a factor to a numeric variable. Once again click on the edit data set button this time selecting the top of the yrsscale column and change the variable to numeric. When you have finished make sure you close both the variable editor and the data editor windows with the X button. Now we can produce the scatterplot. Setup the dialog box as shown opposite. 1 2 3 yrsscale 4 5 6 The result is shown below. But I feel is far less informative than the boxplots we created earlier? 18.0 18.5 19.0 19.5 20.0 hourwage end of document 20.5 21.0 21.5