Programming in R Describing Univariate and Multivariate data Describing univariate data In this session I will explain: • Measures of central tendency and variation • How to use figures to summarize a single variable (univariate data) • How to create these in R. Characteristics of numeric variables • Center, or where do we find most of the data • Distribution or shape, such as a bell shaped curve • Variation or dispersion, how far spread out is the data, on average, how far are observations from the center? • Outliers…have we got Bill Gates in our salary sample? Measure of central tendency The “center” of a data set can be described using two different measures: 1. Mean – the commonly known “average” 2. Median – the midpoint The mean • The sample mean is sometimes called “x bar” x = x n • Translation, add up all the values and divide by the number of values • Usually, this is what people call the average The median • The middle of the data is called the median – Sort the data from smallest to largest – If there are an odd number of observations, the middle number is the median – For even number of observations, the median is the midpoint between the two middle numbers Median price= (7521+8139)/2 or 7830 The mode • The most commonly occurring value – There can be more than one mode (multimodal, bimodal) – Sometimes there is no mode – For categorical variables, the mode is the only possible measure of central tendency • The median and the mode for table are both 62, while the mean is 61. • Table may be a fairly symmetrical variable, with a slight left skew Shape and skewness Normal variables and standard deviation • In a symmetric, bell shaped distribution, we are able to describe the entire distribution using only two numbers, the mean and the standard deviation • The standard deviation is roughly the average distance that observations are from their mean Calculating the standard deviation X x 2 Standard deviation= i n 1 Translation: Find the difference between the mean and each value in the dataset, square each difference, add these up, divide by the total number of values minus 1, then take the square root of that (or, get R to do it for you) And we care because? The Empirical Rule For any normal curve, approximately •68% of the values fall within 1 standard deviation of the mean •95% of the values fall within 2 standard deviations of the mean •99.7% of the values fall within 3 standard deviations of the mean Other things to describe • How many modes? • The range, minimum and maximum Eruption times for of Old Faithful geyser in Yellowstone National Park, 1997 n=107 25 # of eruptions 20 15 10 20 18 17 12 5 5 5 0 3 2 2.2 2.5 2.8 3.1 3.4 0 1.9 3.7 16 8 4 4.3 4.6 4.9 1 5.2 This histogram shows a bimodal shape. The data has a minimum of 1.67 minutes and a maximum of 4.93 minutes, for a range of 3.26 minutes. Time of eruption in minutes http://wps.aw.com/wps/media/objects/15/15719/projects/ch3_faithful/index.html The five number summary • Minimum, maximum, median, lower quartile and upper quartile Minimum Lower Quartile Median Upper Quartile The visual representation of the five number summary is the box or box and whiskers plot Maximum Interpreting box plots ¼ of students slept between 3 and 6 hours, ¼ slept between 6 and 7, ¼ slept between 7 and 8 ¼ slept between 8 and 16 Outlier: any value more than 1.5 interquartile range(IQR) beyond closest quartile, shown with stars. Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. Other ways to visualize data • When developing a visual representation of a single variable, the most common tools are – Histograms, Pie Charts, Bar Charts, Box Plots and Stem and Leaf Plots. • We’ve already seen a histogram and a box plot Pie charts • Excellent for categorical variables with 5 or fewer categories. Diamond Sales in Millions for 30 NYC Jewelers in 2001 1.2, 9% 1st Qtr 1.4, 10% 3.2, 23% 2nd Qtr 8.2, 58% 3rd Qtr 4th Qtr Bar charts • Can be used to illustrate categories, or means and medians by categories Clarity of 30 diamonds sold by an Atlanta Jeweler, 2011 Number of diamonds 12 11 10 8 8 6 6 5 4 2 0 Two Three Four Clarity ranking Five How to produce these in R • The function summary() to get mean, median, first quartile, third quartile, minimum, and maximum. • table() to get frequency counts • prop.table() to get percentages • Plus, pie(), barplot(), hist(), and boxplot() to get pie, bar plots, histograms, and box plots, respectively.