Hypothesis Testing * The 7

advertisement
Programming in R
Describing Univariate and
Multivariate data
Describing univariate data
In this session I will explain:
• Measures of central tendency and
variation
• How to use figures to summarize a
single variable (univariate data)
• How to create these in R.
Characteristics of numeric variables
• Center, or where do we find most of the
data
• Distribution or shape, such as a bell
shaped curve
• Variation or dispersion, how far spread
out is the data, on average, how far are
observations from the center?
• Outliers…have we got Bill Gates in our
salary sample?
Measure of central tendency
The “center” of a data set can be
described using two different measures:
1. Mean – the commonly known
“average”
2. Median – the midpoint
The mean
• The sample mean is sometimes
called “x bar”
x =
x
n
• Translation, add up all the values
and divide by the number of values
• Usually, this is what people call the
average
The median
• The middle of the data is called the
median
– Sort the data from smallest to largest
– If there are an odd number of
observations, the middle number is the
median
– For even number of observations, the
median is the midpoint between the
two middle numbers
Median price=
(7521+8139)/2 or
7830
The mode
• The most commonly occurring
value
– There can be more than one mode
(multimodal, bimodal)
– Sometimes there is no mode
– For categorical variables, the mode is
the only possible measure of central
tendency
• The median
and the mode
for table are
both 62, while
the mean is
61.
• Table may be
a fairly
symmetrical
variable, with
a slight left
skew
Shape and skewness
Normal variables and
standard deviation
• In a symmetric, bell shaped
distribution, we are able to describe
the entire distribution using only two
numbers, the mean and the
standard deviation
• The standard deviation is roughly
the average distance that
observations are from their mean
Calculating the standard
deviation
 X  x 
2
Standard deviation=
i
n 1
Translation: Find the difference between the
mean and each value in the dataset, square
each difference, add these up, divide by the
total number of values minus 1, then take the
square root of that (or, get R to do it for you)
And we care because?
The Empirical Rule
For any normal curve, approximately
•68% of the values fall within 1 standard
deviation of the mean
•95% of the values fall within 2 standard
deviations of the mean
•99.7% of the values fall within 3 standard
deviations of the mean
Other things to describe
• How many modes?
• The range, minimum and maximum
Eruption times for of Old Faithful geyser in
Yellowstone National Park, 1997 n=107
25
# of eruptions
20
15
10
20
18
17
12
5
5
5
0
3
2
2.2
2.5
2.8
3.1
3.4
0
1.9
3.7
16
8
4
4.3
4.6
4.9
1
5.2
This histogram shows
a bimodal shape.
The data has a
minimum of 1.67
minutes and a
maximum of 4.93
minutes, for a range
of 3.26 minutes.
Time of eruption in minutes
http://wps.aw.com/wps/media/objects/15/15719/projects/ch3_faithful/index.html
The five number summary
• Minimum, maximum, median, lower
quartile and upper quartile
Minimum
Lower
Quartile
Median
Upper
Quartile
The visual representation of the five
number summary is the box or box and
whiskers plot
Maximum
Interpreting box plots
¼ of students slept between
3 and 6 hours,
¼ slept between 6 and 7,
¼ slept between 7 and 8
¼ slept between 8 and 16
Outlier: any value
more than 1.5 interquartile range(IQR)
beyond closest quartile, shown with stars.
Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc.
Other ways to visualize data
• When developing a visual
representation of a single variable,
the most common tools are –
Histograms, Pie Charts, Bar Charts,
Box Plots and Stem and Leaf Plots.
• We’ve already seen a histogram
and a box plot
Pie charts
• Excellent for categorical variables
with 5 or fewer categories.
Diamond Sales in Millions for 30
NYC Jewelers in 2001
1.2, 9%
1st Qtr
1.4, 10%
3.2, 23%
2nd Qtr
8.2, 58%
3rd Qtr
4th Qtr
Bar charts
• Can be used to illustrate categories, or
means and medians by categories
Clarity of 30 diamonds sold by
an Atlanta Jeweler, 2011
Number of diamonds
12
11
10
8
8
6
6
5
4
2
0
Two
Three
Four
Clarity ranking
Five
How to produce these in R
• The function summary() to get
mean, median, first quartile, third
quartile, minimum, and maximum.
• table() to get frequency counts
• prop.table() to get percentages
• Plus, pie(), barplot(), hist(), and
boxplot() to get pie, bar plots,
histograms, and box plots,
respectively.
Download