Statistics 100 Lecture Set 6 Re-cap • Last day, looked at a variety of plots • For categorical variables, most useful plots were bar charts and pie charts • Looked at time plots for quantitative variables • Key thing is to be able to quickly make a point using graphical techniques Re-cap • Recall: – A distribution of a variable tells us what values it takes on and how often it takes these values. Histograms • Similar to a bar chart, would like to display main features of an empirical distribution (or data set) • Histogram – Essentially a bar chart of values of data – Usually grouped to reduce “jitteriness” of picture – Groups are sometimes called “bins” Histogram 8 • Uses rectangles to show number (or percentage) of values in intervals 4 • X-Axis usually shows intervals Counts 6 • Y-axis usually displays counts or percentages 0 2 • Rectangles are all the same width 6 8 10 12 Data 14 16 Example (discrete data) • In a study of productivity, a large number of authors were classified according to the number of articles they published during a particular period of time. Number of articles 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Frequency 784 204 127 50 33 28 19 19 6 7 6 7 4 4 5 3 3 00 unt 600 800 Example (discrete data) Example (continuous data) • Experiment was conducted to investigate the muzzle velocity of a anti-personnel weapon (King, 1992) • Sample of size 16 was taken and the muzzle velocity (MPH) recorded 279.9 294.1 260.7 277.9 285.5 298.6 265.1 305.3 264.9 262.9 256.0 255.3 276.3 278.2 283.6 250.0 Constructing a Histogram – continuous data • Find minimum and maximum values of the data • Divide range of data into non-overlapping intervals of equal length • Count number of observations in each interval • Height of rectangle is number (or percentage) of observations falling in the interval • How many categories? Example • Experiment was conducted to investigate the muzzle velocity of a anti-personnel weapon (King, 1992) • Sample of size 16 was taken and the muzzle velocity (MPH) recorded 279.9 294.1 260.7 277.9 285.5 298.6 265.1 305.3 264.9 262.9 256.0 255.3 276.3 278.2 283.6 250.0 • What are the minimum and maximum values? • How do we divide up the range of data? • What happens if have too many intervals? • Too Few intervals? • Suppose have intervals from 240-250 and 250-260. In which interval is the data point 250 included? Intervals [240,250) [250,260) [260,270) [270,280) [280,290) [290,300) [300,310) Frequency 2 1 0 Frequency 3 4 Histogram of Muzzle Velocity 240 260 280 Muzzle Velocity (MPH) 300 320 Interpreting histograms • Gives an idea of: – Location of centre of the distribution – How spread are the data – Shape of the distribution • Symmetric • Skewed left • Skewed right • Unimodal • Bimodal • Multimodal – Outliers • Striking deviations from the overall pattern Example – mid-term 1 grades (2011) • Was out of 34 + a bonus question (n=344) Example – mid-term 1 grades (2011) • Too many bins? Example – mid-term 1 grades (2011) • Too few bins Example – mid-term 1 grades (2011) • Potential outlier? G r ades f r om a St a t s Cla s s 9 0 1 0 0 8 6 4 2 0 Co u n t 1 0 1 2 1 4 Ex a m 6 0 7 0 8 0 E x a m 1 G r a d e s Example G r ades f or Un d e r g r a d u a t e s 6 4 2 0 Co u n t 8 1 0 Ex a m 6 0 7 0 8 0 Ex a m 1 9 0 G r a d e s 1 0 0 Example G r ades f or G r a d u a t e St u d e n t s 4 2 0 Co u n t 6 8 Ex a m 6 0 7 0 8 0 Ex a m 1 9 0 G r a d e s 1 0 0 Numerical Summaries (Chapter 12) • Graphic procedures visually describe data • Numerical summaries can quickly capture main features Measures of Center • Have sample of size n from some population, x , x ,..., x 1 2 • An important feature of a sample is its central value • Most common measures of center - Mean & Median n Sample Mean • The sample mean is the average of a set of measurements • The sample mean: n x x i1n i Sample Median • Have a set of n measurements, x , x ,..., x 1 2 n • Median (M) is point in the data that divides the data in half • Viewed as the mid-point of the data • To compute the median: 1. For sample size “n”, compute position = (n+1)/2 1. 2. If position is a whole number, then M is the value at this position of the sorted data If position falls between two numbers, then M is the value halfway between those two positions in the sorted data Example • Finding the Median, M, when n is odd • Example: Data = 7, 19, 4, 23, 18 Muzzle Velocity Example • Data (n=16) 279.9 294.1 260.7 277.9 285.5 298.6 265.1 305.3 264.9 262.9 256.0 255.3 276.3 278.2 283.6 250.0 Muzzle Velocity Example • Mean: 279.9 294.1 ... 250.0 x 274.6 16 Muzzle Velocity Example • Median: Sample Mean vs. Sample Median • Sometimes sample median is better measure of center • Sample median less sensitive to unusually large or small values called • For symmetric distributions the relative location of the sample mean and median is • For skewed distributions the relative locations are Other Measures of interest • Maximum • Minimum • Percentiles – A percentile of a distribution is a value that cuts off the stated part of the distribution at or below that value, with the rest at or above that value. • 5th percentile: 5% of distribution is at or below this value and 95% is at or above this value. • 25th percentile: 25% at or below, 75% at or above • 50th percentile: 50% at or below, 50% at or above • 75th percentile: 75% at or below, 25% at or above • 90th percentile: 90% at or below, 10% at or above • 99th percentile: ___% at or below, ___% at or above • Percentiles – Can be applied to a population or to a sample • Usually don’t know population – Use sample percentiles to estimate pop. percentiles – Standardized tests often measured in percentiles – Birth statistics often measured in percentiles • First – – – daughter 10th percentile weight 25th percentile length 95th percentile head circumference Important Percentiles • First Quartile • Second Quartile • Third Quartile Computing the quartiles • You know how to compute the median • Q1 = • Q3 = Example • Finding the other quartiles 1. 2. • Example: 4, 7, 18, 19, 23, – – • For Q1, find the median of all values below M. For Q3, find the median of all values above M. Q1: Q3: Example: 4, 7, 12, 18, 19, 23, – – M=18 Q1: Q3: M=15 • 5 number summary often reported: – Min, Q1, Q2 (Median), Q3, and Max – Summarizes both center and spread – What proportion of data lie between Q1 and Q3? Box-Plot Displays 5-number summary graphically • Box drawn spanning quartiles • Line drawn in box for median • Lines extend from box to max. and min values. • Some programs draw whiskers only to 1.5*IQR above and below the quartiles 75 70 65 60 Ex am Sc ores 80 85 • Undergrads • Can compare distributions using side-by-side box-plots • What can you see from the plot? Example - Moisture Uptake • There is a need to understand degradation of 3013 containers during long term storage • Moisture uptake is considered a key factor in degradation due to corrosion • Calcination removes moisture • Calcination temperature requirements were written with very pure materials in mind, but the situation has evolved to include less pure materials, e.g. high in salts (Cl salts of particular concern) • Calcination temperature may need to be reduced to accommodate salts. • An experiment is to be conducted to see how the calcination temperature impacts the mean moisture uptake Working Example - Moisture Uptake • Experiment Procedure: – Two calcination temperatures…wish to compare the mean uptake for each temperature – Have 10 measurements per temperature treatment – The temperature treatments are randomly assigned to canisters • Response: Rate of change in moisture uptake in a 48 hour period (maximum time to complete packaging) Box-Plot Other Common Measure of Spread: Sample Variance • Sample variance of n observations: n s 2 (x x) i 1 i n 1 • • Units are in squared units of data 2 Sample Standard Deviation • Sample standard deviation of n observations: n s • • Has same units as data (x x) i 1 i n 1 2 Exercise • Compute the sample standard deviation and variance for the Muzzle Velocity Example Comments • Variance and standard deviation are most useful when measure of center is • As observations become more spread out, s : increases or decreases? • Both measures sensitive to outliers • 5 number summary is better than the mean and standard deviation for describing (i) skewed distributions; (ii) distributions with outliers Comments • Standard deviation is zero when • Measures spread relative to More interpretation of s • Empirical Rule – If a distribution is bell-shaped and roughly symmetric, then • About 2/3 of data will lie within ±1s of x • About 95% of data will lie within ±2s of x • Usually all data will lie within ±3s of x – So you can reconstruct a rough picture of the histogram from just two numbers Example • Mid-term: Example • Mid-term: • Empirical rule tells us