Describing data with graphics and numbers QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Types of Data • Categorical Variables – also known as class variables, nominal variables • Quantitative Variables – aka numerical nariables – either continuous or discrete. Graphing categorical variables Ten most common causes of death in Americans between 15 and 19 years old in 1999. Bar graphs Graphing numerical variables Heights of BIOL 300 students (cm) 165 170 142 173 168 168 155 160 165 165 163 152 154 165 180 173 190 165 175 165 170 170 156 155 163 168 177 166 Stem-and-leaf plot Stem-and-leaf plot 19 18 17 16 15 14 0 0 0003357 0335555556888 24556 2 Frequency table Height Group 141-150 151-160 161-170 171-180 181-190 Frequency Frequency table Height Group Frequency 141-150 1 151-160 6 161-170 15 171-180 5 181-190 1 Histogram Histogram Histogram Frequency distribution Histogram with more data Cumulative Frequency Distribution 1 0.8 Cumulative Frequency 0.6 0.4 0.2 150 160 170 180 190 200 210 Height (in cm) of Bio300 Students Cumulative Frequency Distribution 1 0.8 Cumulative Frequency 0.6 0.4 0.2 150 160 170 180 190 200 210 Height (in cm) of Bio300 Students 50th percentile (median) 90th percentile Associations between two categorical variables Association between reproductive effort and avian malaria Table 2.3A. Contingency table showing incidence of malaria in female great tits subjected to experimental egg removal. con t ro l gro up m a la r ia no m a la r ia co lum n t ot al e gg re mo va l gro up r ow t ot al 7 28 15 15 22 43 35 30 65 Association between reproductive effort and avian malaria Table 2.3A. Contingency table showing incidence of malaria in female great tits subjected to experimental egg removal. con t ro l gro up m a la r ia no m a la r ia co lum n t ot al e gg re mo va l gro up r ow t ot al 7 28 15 15 22 43 35 30 65 Mosaic plot Relative frequency 1.0 0.8 0.6 0.4 0.2 0.0 Control Egg removal Treatment Fig ure 2.3B. Mosaic plot for reproductive effort and avian malaria in great tits ( Table 2.3A). Blue fill indicates diseased birds whereas the white fill indicates birds fr ee of malaria. n = 65 birds. Grouped Bar Graph 25 20 15 10 5 0 Malaria No malaria Control Malaria No malaria Egg removal Associations between categorical and numerical variables Multiple histograms 600 Non-conserved 400 200 0 0 200 400 600 800 600 1000 Conserved 400 200 0 0 200 400 600 Protein length 800 1000 Associations between two numerical variables Scatterplots Scatterplots Evaluating Graphics • Lie factor • Chartjunk • Efficiency QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Don’t mislead with graphics Better representation of truth Lie Factor • Lie factor = size of effect shown in graphic size of effect in data Lie Factor Example Effect in graphic: 2.33/0.08 = 29.1 Effect in data: 6748/5844 = 1.15 Lie factor = 29.1 / 1.15 = 25.3 Chartjunk 4th Qtr 3rd Qtr North West East 2nd Qtr 1st Qtr 0 50 100 Needless 3D Graphics QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor Summary: Graphical methods for frequency distributions Type of Data Categor ical da ta Numerical data Method Bar gr aph Histogram Cumulative frequen cy dis tribution Summary: Associations between variables Explanatory var iable Response variable Categorical Numerical Contingen cy table Grouped b ar graph Categorical Mosaic plot Multiple histograms Scatter plot Cumulative frequency distributions Numerical Great book on graphics QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Describing data Two common descriptions of data • Location (or central tendency) • Width (or spread) Measures of location Mean Median Mode Mean n Y i Y i1 n n is the size of the sample Mean Y1=56, Y2=72, Y3=18, Y4=42 Mean Y1=56, Y2=72, Y3=18, Y4=42 Y = (56+72+18+42) / 4 = 47 Median • The median is the middle measurement in a set of ordered data. The data: 18 28 24 25 36 14 34 The data: 18 28 24 25 36 14 34 can be put in order: 14 18 24 Median is 25. 25 28 34 36 Median Mode Mean 12.5 10.0 Frequency 7.5 5.0 2.5 0.0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Mouse weight at 50 days old, in a line selected for small size Mean vs. median in politics • 2004 U.S. Economy • Republicans: times are good – Mean income increasing ~ 4% per year • Democrats: times are bad – Median family income fell • Why? Mean 169.3 cm Median 170 cm Mode 165-170 cm 1 0.8 Cumulative Frequency 0.6 0.4 0.2 150 160 170 180 190 200 210 Height (in cm) of Bio300 Students Measures of width • • • • Range Standard deviation Variance Coefficient of variation Range 14 17 18 20 22 22 24 25 26 28 28 28 30 34 36 Range 14 17 18 20 22 22 24 25 26 28 28 28 30 34 36 The range is 36-14 = 22 Population Variance N Y 2 2 i i1 N Sample variance n Y Y i s 2 i1 n 1 n is the sample size 2 Shortcut for calculating sample variance n 2 Yi n 2 2 i1 s Y n 1 n Standard deviation (SD) • Positive square root of the variance is the true standard deviation s is the sample standard deviation In class exercise Calculate the variance and standard deviation of a sample with the following data: 6, 1, 2 Answer Variance=7 Standard deviation = 7 Coefficient of variance (CV) CV = 100 s /Y . Equal means, different variances V=1 0.4 0.3 V=2 Frequency 0.2 V=10 0.1 -5 0 Value 5 10 Manipulating means • The mean of the sum of two variables: E[X + Y] = E[X]+ E[Y] • The mean of the sum of a variable and a constant: E[X + c] = E[X]+ c • The mean of a product of a variable and a constant: E[c X] = c E[X] • The mean of a product of two variables: E[X Y] = E[X] E[Y] if and only if X and Y are independent. Manipulating variance • The variance of the sum of two variables: Var[X + Y] = Var[X]+ Var[Y] if and only if X and Y are independent. • The variance of the sum of a variable and a constant: Var[X + c] = Var[X] • The variance of a product of a variable and a constant: Var[c X] = c2 Var[X] Parents’ heights Mean Variance Father Height 174.3 71.7 Mother Height 160.4 58.3 Father Height +Mother Height 334.7 184.9