Chapter 1 - Exploring Data AP Statistics Amanda Carter Borrowed from Brian Hamilton & Jason Molesky Chapter 1 Section 1 Displaying Distributions with Graphs HW: 1.1(a, b), 1.2, 1.3, 1.4, 1. 5, 1.6 1.8, 1.9, 1.10, 1.11 1.13, 1.14, 1.16 Exploratory Data Analysis o Exploratory Data Analysis: Statistical practice of analyzing distributions of data through graphical displays and numerical summaries. o Distribution: Description of the values a variable takes on and how often the variable takes on those values. o An EDA allows us to identify patterns and departures from patterns in distributions. Categorical Data • Categorical Variable: Values are labels or categories. Distributions list the categories and either the count or percent of individuals in each. • Displays: BarGraphs & Pie chart(not done in AP) Bar Graph Quantitative Data • Quantitative Variable: • Values are numeric - arithmetic computation makes sense (average, etc.) • Distributions list the values and # of times the variable takes on that value. • Displays: • • • • Dotplots Stemplots Histograms Boxplots Only organized Data can Illuminate! Your goal is to make neat, organized, labeled graphs that display the distribution of data effectively and provide an insight into patterns and departures from patterns. Dotplots • Small datasets with a small range (max-min) can be easily displayed using a dotplot. Draw and label a number line from min to max. Place one dot per observation above its value. Stack multiple observations evenly. 34 values ranging from 0 to8. Stemplots • A stemplot gives a quick picture of the shape of a distribution while including the numerical values. • Separate each observation into a stem and a leaf. • eg. 14g -> 1|4 256 -> 25|6 32.9oz -> 32|9 • Write stems in a vertical column and draw a vertical line to the right of the column. • Write each leaf to the right of its stem. Stemplots • Example1.4, pages 42-43 • Literacy Rates in Islamic Nations Stemplots • Note: Stemplots do not work well for large data sets • Back-to-Back Stemplots: Compare datasets • Splitting Stems: Double the number of stems, writing 0-4 after the first and 5-9 after second. Histograms • Histograms break the range of data values into classes and displays the count/% of observations that fall into that class. Divide the range of data into equal-width classes. Count the observations in each class “frequency” Draw bars to represent classes - height = frequency Bars should touch (unlike bar graphs). Histograms • Let’s look ex. 1.6, page 49 on 5th grade IQ scores Histograms TAKE THE DATA AND MAKE CLASSES WITH FREQUENCY • Describe the SOCS or SECS • What do these data suggest? EDA Summary • The purpose of an Exploratory Data Analysis is to organize data and identify patterns/departures. PLOT YOUR DATA - Choose an appropriate graph Look for overall pattern and departures from pattern Shape {mound, bimodal, skewed, uniform} Outliers {points clearly away from body of data} Center {What number “typifies” the data?} Spread {How “variable” are the data values?} Let’s Look/Investigate Draw different “shaped” histograms Cumulative Relative Frequency Graphs (OGIVE) Histograms do a good job of displaying the distribution of values or a quantitative variable, but tell us little about the relative standing of an individual observation. If we want relative standing information, we use an ogive. Cumulative Relative Frequency Graphs (OGIVE) Cumulative Relative Frequency Graphs • Decide on class intervals and determine the frequency in each class (like with a histogram), but add some additional columns to your table. • Label and scale your axes and title your graph. • Plot a point corresponding to the relative cumulative frequency in each class interval at the left endpoint of the next class interval. Using Ogives Bill Clinton took office at the age of 46. Find his relative standing for the age he took office. About 10% of all U.S. Presidents were the same age as or younger than Bill Clinton when they were inaugurated. Average Price of Gasoline from 1996-2006 Chapter 1 Section 2 Describing Distributions with Numbers HW: 1.27, 1.29-1.31, 1.33, 1.36, 1.37, 1.41, 1.43, 1.45, 1.46 Describing Distributions with Numbers What three things do we always include when describing a distribution of numbers? • Shape • Center • Spread Describing Distributions with Numbers What three things do we always include when describing a distribution of numbers? • Shape- use a histogram • Center • Spread Describing Distributions with Numbers What three things do we always include when describing a distribution of numbers? • Shape- use a histogram • Center• Spread- Describing Center Find the mean for the Fuel Economy for 2004 model motor vehicles You can use a graphing calculator How Do I write this? • Use your notation and begin the calculation (this tells me you understand the formula) and it will allow for partial credit versus no credit…. • (show example) Means & Outliers Are there any values that appear to be outliers in our fuel economy data? If so, find the mean without the outlier. The single outlier adds 2 mpg onto the mean highway mileage of 2004 motor vehicles. The mean is sensitive to the influence of a few extreme observations. It is not a resistant measure of center. Finding the median • Median highway mileage for two-seaters 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32 6 Without the Honda insight? 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 29 32 • The median is a more resistant measure of center than the mean. Mean verses Median How do you think the mean and median compare in a symmetric distribution? In a right skewed distribution? Don't confuse the “average” value (mean) with the “typical” value (median). Mean vs. Median The mean and the median are the most common measures of center. If a distribution is perfectly symmetric, the mean and the median are the same. The mean is not resistant to outliers. You must decide which number is the most appropriate description of the center... MeanMedian Applet Mean verses Median Generally, when you encounter outliers, correct them if wrongly recorded, delete them for good reason, or otherwise give them individual attention. If correctly recorded outliers cannot be discarded, consider using resistant measures. Measuring Spread Quartiles Quartiles Q1 and Q3 represent the 25th and 75th percentiles. To find them, order data from min to max. Determine the median - average if necessary. The first quartile is the middle of the ‘bottom half’. The third quartile is the middle of the ‘top half’. 19 22 23 23 23 Q1=23 45 68 74 Q1 26 26 27 28 med 75 76 med=79 29 30 31 32 Q3=29.5 82 82 91 Q3 93 98 Finding the Quartiles Find the Quartiles, and IQR. 13 15 16 16 17 19 20 22 23 23 23 24 25 25 26 28 28 28 2 3 Computer Software Output 5-Number Summary, Boxplots We can visualize the 5 Number Summary with a boxplot. min=45 45 50 Q1=74 55 60 med=79 65 70 Q3=91 75 80 Quiz Scores 85 max=98 90 95 100 Measures of Spread Variability is the key to Statistics. Without variability, there would be no need for the subject. When describing data, never rely on center alone. Measures of Spread: Range - {rarely used...why?} Quartiles - InterQuartile Range {IQR=Q3Q1} Variance and Standard Deviation {var and sx} Like Measures of Center, you must choose the most appropriate measure of What about OUTLIERS? THIS HELPS SETS UP “FENCES” TO SEE IF THERE ARE ANY DATA THAT FALLS OUTSIDE..IF SO THEY ARE OUTLIERS 1.5 • IQR Rule To determine outliers: Find 5 Number Summary Determine IQR Multiply 1.5xIQR Set up “fences” Q1-(1.5IQR) and Q3+(1.5IQR) Observations “outside” the fences are outliers. Outlier Example All data on p. 48. IQR=45.72-19.06 IQR=26.66 fence: 19.0639.99 = -20.93 fence: 45.72+39.99 = 85.71 outliers } { 0 1.5IQR=1.5(26.66) 1.5IQR=39.99 10 20 30 40 50 60 70 Spending ($) 80 90 100 Let’s CREATE a Box-Plot What do we need? Standard Deviation Another common measure of spread is the Standard Deviation: a measure of the “average” deviation of all observations from the mean. To calculate Standard Deviation: Calculate the mean. Determine each observation’s deviation (x - xbar). “Average” the squared-deviations by dividing the total squared deviation by (n1). This quantity is the Variance. Standard Deviation (x1 x ) 2 (x2 x ) 2 ... (xn x ) 2 Variance:var n 1 sx Standard Deviation: 2 (x x ) i n 1 Example 1.16 (p.85): Metabolic Rates 1792 1666 1362 1614 1460 1867 1439 Standard Deviation 1792 1666 1362 1614 1460 1867 1439 Metabolic Rates: mean=1600 x (x - x) (x - x)2 1792 192 36864 1666 66 4356 1362 -238 56644 1614 14 196 1460 -140 19600 1867 267 71289 1439 -161 25921 Totals: 0 214870 Total Squared Deviation 214870 Variance var=214870/ 6 var=35811.6 6 Standard s=√35811.66 Deviation s=189.24 cal What does this value, s, mean? Linear Transformations Variables can be measured in different units (feet vs meters, pounds vs kilograms, etc) When converting units, the measures of center and spread will change. Linear Transformations (xnew=a+bx) do not change the shape of a distribution. Multiplying each observation by b multiplies both the measure of center and spread by b. Adding a to each observation adds a to the measure of center, but does not affect spread. Data Analysis Toolbox •To answer a statistical question of interest: Data: Organize and Examine Who are the individuals described? What are the variables? Why were the data gathered? When,Where,How,By Whom were data gathered? Graph: Construct an appropriate graphical display Describe SOCS Numerical Summary: Calculate appropriate center and spread (mean and s or 5 number summary) Interpretation: Answer question in context! Chapter 1 Summary Data Analysis is the art of describing data in context using graphs and numerical summaries. The purpose is to describe the most important features of a dataset.