Aggregating data

advertisement
Quick introduction to descriptive statistics
and graphs in
R Commander
Written by: Robin Beaumont e-mail: robin@organplayers.co.uk
http://www.robin-beaumont.co.uk/virtualclassroom/stats/course1.html
Date last updated Wednesday, 24 April 2013
Version: 2
Contents
Boxplots ............................................................................................................................................................................. 2
Percentages for each category/factor level ...................................................................................................................... 3
Summaries for a interval/ratio variable divided across categories (factor levels) ........................................................... 3
Histograms ........................................................................................................................................................................ 4
Density plots...................................................................................................................................................................... 5
Densityplots for subgroups defined by factor levels ........................................................................................................ 6
Graphical summaries of data - aggregation ...................................................................................................................... 7
Aggregating data ..................................................................................................................................................... 11
Boxplots
From within R you need to load R commander by typing in
the following command:
library(Rcmdr)
First of all you need some data and for this example I'll use
the sample dataset, by loading it directly from my website.
You can do this by selecting the R commander menu option:
Data-> from text, the clipboard or URL
Then I have given the resultant dataframe the name
mydataframe, also indicating that it is from a URL (i.e. the
web) and the columns are separated by tab characters.
Clicking on the OK button brings up the internet URL box,
you need to type in it the following to obtain my sample
data:
http://www.robin-beaumont.co.uk/virtualclassroom/stats/basics/coursework/data/pain_medication.dat
6
4
2
time
8
10
12
This dataset has 7 variables of which we are only interested in
two here; time (the outcome variable) and dosage a grouping
variable indicating which group the result ('time') belongs to.
High
Low
dosage
Percentages for each category/factor level
Using the dataset from the boxplots example. Taking a single variable we can obtain the counts for
each category + percentage in R commander.
Consider we wanted to know what the number and
percentage of cases are in each group, that is within
each category (level) of the dosage variable.
The dosage variable is a grouping variable = nominal
data, and each value is said to represent a factor level.
Summaries for a interval/ratio variable divided across categories (factor levels)
We can obtain simple descriptive statistics using the menu
option show opposite we can also find these for subgroups by
using the Summarize by groups option.
Histograms
Say we wanted to see the distribution of ages in our dataset, you
have three options usually you would only show one in a report.
20
0
10
frequency
30
40
Frequency counts:
30
Percentages:
40
50
60
70
80
20
mydataframe$age
0.04
50
60
70
80
density
40
0.01
mydataframe$age
Note the dataframe dollar column name format i.e.
mydataframe$age description of the x axis.
0.00
30
0.02
0.03
10
5
0
percent
15
Density histogram
30
40
50
60
mydataframe$age
70
80
Density plots
A density plot is a smoothed version of a histogram its very useful. Unfortunately there is no r
commander menu option to produce them so you need to type the command:
plot (density(dataframe name $ column name))
So for our dataframe which we have called mydataframe and
the column called age within it we type;
plot( density ( mydataframe$age))
0.02
0.01
0.00
Density
0.03
density.default(x = mydataframe$age)
20
30
40
50
60
N = 200 Bandwidth = 3.239
70
80
90
Densityplots for subgroups defined by factor levels
There are many ways and the easiest is to use the lattice package introduced latter in the course but
for now just considering the gender variable which has only 2 levels we can do the following:
First copy only the male cases into a dataframe called maledata:
select only rows where gender =male
maledata <- mydataframe[mydataframe$gender == "Male",]
note the double = =
to mean "is equal to"
and all the columns in the dataframe
the comma is important
Now copy only the female cases into a dataframe called femaledata:
select only rows where gender =female
femaledata <- mydataframe[mydataframe$gender == "Female",]
note the double = =
to mean "is equal to"
and all the columns in the dataframe
the comma is important
Now create our densityplot
plot the densities of .
the male ages
set the y axis limits to 0 to 0.07
set the x axis label to read . . . . .
plot(density(maledata$age), ylim = c(0, 0.07), main = "densityplots for males/females[dotted] for age", xlab= "age (years)" )
set the main title of the graph to read . .. ...
Now need to superimpose the female density line.
set the line type to 2 which is dotted to differentiate it from teh
default line type solid
lines(density(femaledata$age), lty = 2)
Graphical summaries of data - aggregation
Problem: we want to show hourly wage against years working at a health institution and have the data in the
following format.
First obtain either the healthwagedata.sav or the
healthwagedata.rda, file from the url below and store it on your
local machine.
http://www.robin-beaumont.co.uk/virtualclassroom/book2data/healthwagedata.rda
or
http://www.robin-beaumont.co.uk/virtualclassroom/book2data/healthwagedata.sav
The top left screenshot shows how to load the rda file.
We see there are many entries for each yrsscale (time worked
with institution). While the hourwage shows the average hourly
wage. (top right)
Before we do anything let's check what the summary values are
for each level of employment time using the menu option
statistics -> summaries -> numeric summaries and setup the
dialog box as shown opposite.
Clearly the mean and median hourly rate go up with years
employment, from 18 to 21.63
Because of the multiple hourly wage values for each level of employment time a scatter plot of the raw data is not
appropriate but we have two options:

produce a series of boxplots or means or each group
or

aggregate the data, for example find the mean at each
hourly wage against employment time and then plot these
values.
We can easily produce a boxplot of the above findings.
657
2324
20
10
15
By selecting the identify outliers option: automatically we have
the case numbers marked.
522
1225
5
268
319
5 or less
1972
6-10
2758
2728
1378
18281669
2740
2668
1396
11-15
16-20
2785
511
2125
21-35
2839
2977
36 or more
25
30
yrsscale
10
15
20
By selecting the identify outliers option we now have a clearer,
but possibly less useful graph.
5
hourwage
hourwage
25
30
1488
2078
1415
1585
5 or less
6-10
11-15
16-20
yrsscale
21-35
36 or more
Asking the question what do the many outliers suggest? would
require knowledge of the context in which the data was
collected they might be miscoded values or a particular distinct
subset of employees such as consultants and a definitive
answer needs detailed knowledge of the environment from
where the data was collected.
Ignoring the outliers and assuming that the data are normally
distributed at each no of years employment level we can produce
a graph of means at each level along with a indication of range.
Graphs->plot of means
Selecting the standard errors option we can see the estimated
accuracy of the mean for each group
I feel that presenting the data like this possibly does it a
disservice as it now appears very clean giving no indication of
those very low and high paid workers!
20
19
18
mean of mydataset$hourwage
21
22
Plot of Means
5 or less
6-10
11-15
16-20
mydataset$yrsscale
21-35
36 or more
Notice that the x categories are in the correct order but this is
not always the case, the rda and sav files contained additional
information specifying the factor level order. However if we had
used a plan text file (i.e. .dat or .txt) you would have needed to
reorder the factor levels by using the R Commander menu
option:
Data ->Manage variables in active dataset->Reorder factor>levels
The alternative strategy is to produce a new dataframe
which only consists of the summary values.
To do this we first need to remove all those rows which have
empty values for either the hourwage or yrsscale variables.
data->active data set->remove cases with missing data
See opposite. I have called the new dataframe
cleandataframe.
Notice that the new dataframe is automatically loaded.
The new dataframe has 89 less records
Aggregating data
Aggregating data and new datasets from the aggregated values
is a common occurrence with large datasets and this scenario
provides you with a good example.
Having removed all the cases with missing data we can now
create a newdataframe with just the aggregated data (i.e. the
means) by selecting the menu option:
Then setup the dialog box as shown opposite.
Notice that the new dataframe is automatically loaded.
The new dataframe has 6 records.
Clicking on the edit data set button we can edit the new
dataframe.
When you have finished make sure you close it by clicking on the
X button on the top right hand side of the window.
The next stage is to produce a scatterplot of the means against year,
however we can only do this when we have at least two
interval/ratio variables in the dataframe else the R commander
scatterplot menu option is grayed out. Which it would be if you tried
with the current dataframe. However this is easily fixed by changing
the yrsscale variable from a factor to a numeric variable.
Once again click on the edit data set button this time selecting
the top of the yrsscale column and change the variable to
numeric.
When you have finished make sure you close both the variable
editor and the data editor windows with the X button.
Now we can produce the scatterplot.
Setup the dialog box as shown opposite.
1
2
3
yrsscale
4
5
6
The result is shown below. But I feel is far less informative than
the boxplots we created earlier?
18.0
18.5
19.0
19.5
20.0
hourwage
end of document
20.5
21.0
21.5
Download