CHAPTER 2 - STATISTICS OF ONE VARIABLE Section 2.1: Data Analysis With Graphs - p. 91 MDM 4U1 KEY CONCEPTS & DEFINITIONS DETAILS EXAMPLES Raw data unprocessed information collected for a study Data can be collected using surveys, polls, etc. Ex.1 The number of hours of TV watched by MDM students in a week. Ex.2 The shoe size of each girl in the MDM class. Variable - the quantity being measured In Ex.1, the variable is the number of hours the TV is watched. In Ex.2 it is ______________________. Continuous variable - a var. that can have values in a RANGE Continuous variables must be numerical. Ex.3 Height, weight, class marks, and hours of TV watched are continuous variables. Discrete variable a var. that can only have specific & separate values Discrete variables can be numerical or categorical. Ex.4 Shoe size, hair colour, provinces, and days of the week are discrete variables. *Categorical datadata that is discrete and not numerical Categorical data are given labels. Ex.5 If hair colour is the variable, the data are put into categories with labels, such as BROWN, BLOND, BLACK, RED, etc. Frequency table a table that is used to organize raw data to view the FREQUENCY of the values Frequency tables are useful for summarizing and analyzing data. Ex.6 A frequency table for shoe size from Ex.2. Frequency diagram - a graph of the data in a frequency table *Histogram - a special type of bar graph in which the bars are connected and represent a continuous range of values (A regular bar graph has bars that are separated, indicating separate categories) For a large amount of continuous data, the data is usually grouped into classes or intervals, which makes the graphs easier to construct and interpret. Typically, between 5 to 20 intervals are used, that must cover the range of the data. To find the range, subtract the smallest piece of data from the largest in the list. A frequency diagram could be a histogram, a frequency polygon (line graph), a bar graph, a pie graph, or a pictograph. The first two types of graphs are used most often for continuous variables, while the last three are used most often for discrete variables. Diagrams are useful for displaying and analyzing data. Ex.7 A pie graph for shoe size from Ex.2. Ex.8 A histogram for hours of TV watched from Ex.1. Cumulative frequency graph a graph that shows the RUNNING TOTAL of the frequencies from the lowest value up (also called an ogive) Cumulative freq. graphs are good for answering questions about the data that involve proportion. Ex.9 A cumulative frequency polygon for # of hours of TV watched. What percentage of the students watch ____ hours of TV or less? Relativefrequency graph a graph that shows the frequency of a data group as a fraction of percent of the whole data set Instead of the y-axis reading “frequency”, it will now read “relative frequency” and be expressed in percent. Ex.10 A relative-frequency graph for shoe size (bar graph). HMK - p. 101 #1 (sol’ns wrong in txt), 2 (2c is vague), 3ab, 4ab, 5 (sol’n in txt has too many intervals), 7, 9 (error in txt with endpoints), 11, 13, 15 Section 2.3 - Sampling Techniques - p.113 When a STUDY or a SURVEY is done, it is often impossible/difficult to question EVERYONE it concerns. Most often, researchers use a PERCENTAGE of people concerned, called a SAMPLE. POPULATION: all individuals/items that belong to a group being studied. SAMPLE: a group of people/items SELECTED FROM a population. *A sample must be chosen FAIRLY, as it is meant to REPRESENT the entire population. ex. Population - all students at Villy Sample - our class This would be a bad sample because it’s only grade 12 students, academic, etc. A sample can be chosen in a number of fair ways. The choice of sampling techniques depends on several factors - the nature of the population, cost, convenience, and reliability, and is important for an accurate reflection of the population. SAMPLING TECHNIQUES: 1. SIMPLE RANDOM SAMPLE When the population is made up of identifiable individuals who form one large group, this technique is appropriate. Each member might be assigned a different number, and then numbers can be randomly selected using a computer, out of a hat, etc. ex. All men in Essex County over 50 2. SYSTEMATIC SAMPLE The pop’n is still made up of identifiable individuals who form one large group, but the group may already be organized (ex. phone book, voter’s names, etc.). You select members of the group at regular, sequential intervals (this is still random because you have no idea who will be selected). interval = population size / sample size ex. interval = 500 800 / 300 = 1670 choose ONE of the first 1670 members, and then every 1670th member on, in every interval. 3. STRATIFIED SAMPLE When the pop’n is made up of DISTINCT GROUPS of members (the groups are called STRATA), this sampling technique is used. It is used so that the SAMPLE has the same PROPORTION of members from each stratum as the pop’n. You multiply the number of members in each stratum by a desired percent. ex. Salaries # of members sample (10% of the pop’n) 20000-40000 1200 0.10 x 1200 = 120 40001-60000 800 0.10 x 800 = 80 60001 or more 300 0.10 x 300 = 30 (2300 people) (230 people) 4. CLUSTER SAMPLE When the pop’n is made up of GROUPS, but those groups are all very similar (and likely to be representative of the entire pop’n), a number of groups are chosen randomly for the sample. ex. groups of employees at Roots stores across Canada 5. MULTI-STAGE SAMPLE This technique uses SEVERAL LEVELS of random sampling to narrow down the sample. ex. Population - all students in Ontario secondary schools Sample - 1st, randomly select some municipalities 2nd, randomly select some schools WITHIN those municipalities 3rd, randomly select some students WITHIN those schools 6. VOLUNTARY-RESPONSE SAMPLE (not as fair or representative as other methods) This technique involves all members of the pop’n to participate VOLUNTARILY. ex. call-in show mail-in from a magazine email response survey posted on a bulletin board 7. CONVENIENCE SAMPLE (not as fair or representative as other methods) Sample groups are chosen for convenience and may not be representative). ex. Population - all students at Villi Sample - our class HMK - p.117 #1-4,6,8,11 Section 2.4 - Bias in Surveys - p.119 BIAS: systematic error or undue weighting in a statistical study. (Any factor that favours certain outcomes or responses and hence skews the study results). Bias can occur from choosing the wrong sample and/or collecting data incorrectly. Bias is USUALLY unintentional, but is sometimes intentional to PURPOSELY skew results to a more desirable outcome. 1. SAMPLING BIAS This occurs when the sample does not reflect the characteristics of the population. ex. To conduct a survey about when the next school dance should be, students in the library during all four periods are polled. 2. NON-RESPONSE BIAS This occurs when people who are surveyed refuse to participate. Those who are more concerned may respond more readily, skewing the results. 3. RESPONSE BIAS This occurs when respondents DELIBERATELY give false or misleading answers in a survey. Again, this may be done due to the wording of the questions that may anger/embarass/etc. respondents. 4. MEASUREMENT BIAS This occurs when the method for collecting the data affects the variable it is measuring consistently (under or overestimates the variable, which no longer represents the population characteristic). This could happen in one of three main ways. The METHOD OF COLLECTION can be a problem (who collects the data or how it is collected). ex. Having a teacher survey the number of students who smoke on school property (students may lie to teacher and so the measurement will be LOWERED and will no longer be accurate). If data is collected using QUESTIONS, it is very important how they are worded, or they can produce also produce measurement bias. LEADING QUESTIONS will lead an individual to choose an answer they might not otherwise have chosen, thereby OVERESTIMATING the measurement, and skewing results. ex. What is your favourite colour? a) blue b) green c) red d) other __________ Since it is easier to choose a,b, or c, the results will likely be skewed. LOADED QUESTIONS contain wording or information intended to influence the respondent’s answer, thereby UNDER/OVERESTIMATING the measurement. ex. Do you favour the new uniform policy which will ban flip flops? (Should read: Do you think the new uniform policy is fair?) HMK - p.123 #1-6,8 Section 2.5 - Measures of Central Tendency - p.125 When data is collected, people often want to know the “average” of the data (often for comparison purposes). “Average” can be measured mathematically in three different ways, each with advantages/disadvantages. ___________________________________________________________________ 1. MEAN: the sum of the values of a variable divided by the number of values. POPULATION MEAN SAMPLE MEAN (often used to approximate the pop’n mean) x + x 2 +...+ x µ= 1 ∑x µ= µ: _ x= _ x= x 1 + x 2 +...+ x n n ∑x n mu (population mean) _ x: ∑ x-bar (sample mean) : N: n: 2. MEDIAN: 3. MODE: sigma (the sum of) the number of values in the population the number of values in the sample the middle value of the data when it is ranked from highest to lowest. For an even number of data (and therefore 2 middle values), take the mean of the tow middle values. the value that occurs most frequently in a set of values. There may be no mode, one mode, or several modes. ___________________________________________________________________ Some sets of data have OUTLIERS (values that are DISTANT from the MAJORITY of the data). In a small sample: • outliers can greatly affect the MEAN • the MEDIAN is less affected by outliers • the MODE may not exist or may an outlier! • mean, median, and mode may not agree In a large sample (the more data, the better): • outliers have less effect on MEAN • MEDIAN is even more accurate • MODE is likely to be more accurate • mean, median, and mode are likely to be close *WEIGHTED MEAN A weighted mean must be used when all of the data is not equally as important. Ex.1 Your marks on a quiz (with a weight of 4) vs. a test (with a weight of 8), 65% and 80%. _ We cannot calculate mean with both marks as equally important. Instead, we use this formula: w x + w 2 x 2 +...+ w n x n xw = 1 1 w 1 + w 2 +...+ w n x = (65 + 80) / 2 = 72.5, because that would count _ _ wn : or xw = ∑w x ∑w n n n n n the weighting factor for the value x n 4( 65 ) + 8( 80 ) 4+8 260 + 640 = 12 900 = 12 = 75% _ x= (counting 65 four times and 80 eight times) *DATA GROUPED INTO INTERVALS (Mode cannot be found because we don’t know how many times EACH value in the INTERVAL occurs, we just have a total for the interval). MEAN can be approximated using the midpoint value ( m n ) for each interval, and the frequency ( f n ) for that interval. ∑f m µ≈ ∑f (population) ∑f m x≈ ∑f n _ and n n n n (sample) MEDIAN can be approximated by taking the midpoint of the interval within which the median is found (the midpoint is located by analyzing cumulative frequencies). HMK - p.132 #1-4,7,8,9 (9d should read bar graph, not histogram),11,14 (good communication question, answer to 14b in text is incomplete) Section 2.6: Measures of Spread - p. 136 Recall: Measures of central tendency indicate the average or central values of a set of data. NOW: Measures of spread indicate how closely a set of data clusters around its centre. The measure of spread discussed will depend on whether the MEAN or the MEDIAN has been calculated. There is no measure of spread for the mode. MEASURE OF SPREAD FOR THE MEAN: The STANDARD DEVIATION and VARIANCE of a set of data show how the data cluster around the mean of the data. A Z-SCORE shows how FAR a datum (one data value) is from the mean, numerically, in terms of standard deviations. STANDARD DEVIATION: Deviation - the difference between a value and the mean. ex. 70% - 65% = 5% deviation 58% - 65% = -7% deviation (mark) (mean) (negative dev. b/c mark < mean) The larger the deviations, the greater the SPREAD of the data. Standard Deviation - the square root of the mean of the squares of the deviations. *The “standard” deviation is like the “average” deviation for a data value. *(see table and calculations on the next page) (Sigma) σ= ∑( x − µ ) 2 N (Pop’n) s= ∑ (x − x ) n− 1 (Sample) 2 *there is greater weight on larger deviations due to the squaring **A small standard deviation indicates that data cluster CLOSELY around the mean, while a larger st. dev. indicates that the data is quite spread out.** Ex. 1 Quiz scores (out of 10) for a class of 16 students. Calculate the stand. dev. 5 7 9 6 5 10 8 2 11 8 7 7 6 9 5 First, we must decide if we are using the pop’n or the sample formula. Because it’s the whole class (not a sample), we will use the pop’n formula. For the formula for σ , we need µ, x − µ, ( x − µ ) N is 16 (total number of data). 2 and N. We will make a table. 8 x−µ Data (x − µ)2 2 5,5,5 6,6 7,7,7 8,8,8 9,9 10 11 VARIANCE: Variance - the mean of the squares of the deviations (the square of standard deviation!) *(ACTUAL mathematical measure of spread vs. standard deviation, but more difficult to understand because it is in SQUARE units; quality control use as an example of variance application) 2 σ = ∑( x − µ ) 2 N (Pop’n) 2 s = ∑ (x − x) n− 1 2 (Sample) Ex. 2 Find the variance for the data in Ex. 1. Z-SCORE (not a measure of spread) A z-score is the number of standard deviations that a datum is from the mean. z= x−µ (Pop’n) σ z= x− x s (Sample) *The z-score is found by dividing the deviation of the datum by the standard deviation. Ex. 3 Determine the z-score of the marks 5/10 and 10/10 from Ex. 1. MEASURE OF SPREAD FOR THE MEDIAN: Data that has been listed in ascending/descending order can be divided into QUARTILES (four equal groups of data, separated by key values) or PERCENTILES (100 equal groups or intervals, separated by key values for each “percentile”). For data divided into quartiles, the INTERQUARTILE RANGE and SEMI-INTERQUARTILE RANGE show how data clusters around a median. QUARTILES: A quartile divides a set of ordered data into four equal groups. The three “dividing” points are Q1 (1st quartile), Q2 (2nd quartile or median), and Q3 (3rd quartile). *Q1 is the median of the lower half of the data, and Q3 is the median of the upper half. The INTERQUARTILE RANGE is Q3-Q1 (the range of the data in the middle half of the set). The larger the interquartile range, the larger the SPREAD of the central half of the data. The SEMI-INTERQUARTILE RANGE is half the interquartile range (measuring the spread in the middle of the interquartile range). This measure is not as useful as the interquartile range. Ex. 4 Determine the median, Q1, Q3, and the ranges for the data in Ex. 1. First, the data must be placed in order: 2 5 5 5 6 6 7 7 7 8 8 8 9 9 10 11 Data is sometimes said to be “within a quartile”, which really means within a quarter divided by the three quartiles. Ex. 5 The score of 2 is in the 1st quartile (or lower quartile). PERCENTILES: A percentile divides the data into 100 equal intervals. Each percentile is labelled P1, P2, P3, ... , P99. In general, we refer to the nth percentile as Pn. We say that “n” percent of the data is less than or equal to Pn, and (100 - n) percent are greater than Pn. Ex. 6 Use the data below (final grades for a class) to answer the questions. 35 38 41 44 45 45 47 50 51 53 63 56 57 58 58 59 60 62 62 62 62 63 63 64 64 65 65 66 67 67 67 68 68 69 69 70 72 72 73 74 75 75 76 78 79 81 82 82 83 84 86 86 87 88 90 91 92 94 96 98 a) What is the 70th percentile for this data? b) What is the 25th percentile for this data? c) What percentile corresponds to a grade of 81%? d) What percentile corresponds to a grade of 82%? (60 pieces of data) HMK - p. 148 #1, 2b, 3 (3c wrong in text), 4, 5 (sample, not pop’n, though vague in question), 6abd (sample, not pop’n as it should be), 7ab, 10, 11, 14 (a-ii & c wrong in text) REVIEW HMK - p. 151 #1, 2 (for 2a, use larger # of intervals b/c the range is so large), 3ac, 4a-d, 5, 6, 7, 9, 10, 11, 12abc, 14, 15, 16ac, 18a (a-ii wrong in text), 19, 20ab (use sample, not pop’n like text), p.154 #1, 2, 3ac, 4, 5, 7, 8, 9 CHAPTER 2 - FORMULAS µ= ∑x ∑w n n xn ∑f m ∑f n xw = ∑x x= µ≈ wn ∑f m ∑f n x≈ ∑ ( x − µ) n σ= n 2 n n s= ∑ ( x − x) n− 1 ∑ ( x − x) s = n−1 2 z= x− x s 2 ∑ ( x − µ) σ = 2 2 z= x− µ σ 2