YEAR 12 STATISTICS S 7-1 Carry out investigations of phenomena, using the statistical inquiry cycle: A – conducting surveys that require random sampling techniques and using existing data sets B – evaluating the choice of measures for variables and the sampling and data collection methods used C – using relevant contextual knowledge, exploratory data analysis, and statistical inference Uses the statistical inquiry cycle to conduct surveys and to analyse existing data sets. Conducts surveys to find solutions to problems (or uses existing data sets): o Poses survey questions, considering sources of variation, for example, what are the variables to be collected, how each variable will be measured. o Designs, trials, and improves questionnaires using a range of appropriate questions types, checking the survey questions using, for example, desk review, conducting pilot surveys. o Selects and uses appropriate sampling methods, for example, simple random, systematic, stratified, cluster, and quota. o Evaluates sampling method used, for example, is a sample sufficiently large, randomly chosen, and representative of the population. o Collects and manages data. o Uses exploratory data analysis to explore features of the data: Uses appropriate statistical plots and tables to explore the data and communicates relevant detail and overall distributions. Uses appropriate measures to communicate features of the data. Uses relevant contextual knowledge when communicating findings. Makes statistical inferences. Communicates findings in a report which includes: o relevant summary statistics, graphs and tables to support the findings of the survey o quantitative and qualitative statements o statistical inferences o justified conclusions. o o o S 7-2 Make inferences from surveys and experiments: o o B – using sample statistics to make point estimates of population parameters C – recognising the effect of sample size on the variability of an estimate B. Using sample statistics to make point estimates of population parameters. document1 Page 1 Understands that the sample statistics can be used as point estimates of the population parameters, for example, sample medians and IQRs can be used as point estimates for population medians and IQRs, or sample proportions for population proportions when using categorical data C. Recognising the effect of sample size on the variability of an estimate: Within the context of an investigation and statistical plots of observed data: o Find informal confidence intervals for population medians o Plots sample data showing informal confidence intervals (median ± 1.5 IQR / √n) on boxplots. o Uses an informal confidence interval to make an inference about the population median from sample data plot. o Makes a claim about whether one group has larger values than another group using informal confidence intervals for the population medians. o Explains the connections among sample, population, sampling variability, sample size effect, informal confidence interval, and degree of confidence. S 7-3 Evaluate statistically based reports: o B – identifying sampling and possible non-sampling errors in surveys, including polls B. Identifying sampling and possible non-sampling errors in surveys, including polls: In a media report on a survey or a poll, identifies sampling error and explains the connection among sample, population, sampling variability, and sampling error. In a media report on a survey or a poll, identifies and evaluates, using critical questions (look under the heading critical questions in the work doc below), sampling methods and possible non-sampling errors such as self-selection, non response bias, behavioural considerations. Exploratory data analysis notes Exploratory data analysis starts with multivariate data. Investigative questions that can be asked of the data should be posed: such as o wondering whether there is a connection between two variables, o wondering whether other variables should be taken into account when possible patterns are observed, o exploring multiple representations of the data into order to unlock the stories in the sample data. Technology such as a graphics calculator can draw a modified box plot, which shows whether extreme data values are outliers. Outliers are not simply the greatest or least data values. Outliers are more than 1.5 times the standard deviation above the upper quartile or below the lower quartile. If the sample box plot is approximately symmetrical and has no outliers it can be assumed the population has a similar distribution. document1 Page 2 If the sample data is skewed, then the median will be more reliable than the mean as an estimate of the population central value. However, if the distribution of the sample data is skewed this does not imply that the population is skewed. The skewness may be an artefact of sampling variability. A statistical estimate is not a guess but an inference or prediction of the true population parameter based on sample statistics. The sample median is used to infer (used as a point estimate of) the population median. Similarly the sample mean, quartiles, standard deviation can be used as estimates of the corresponding population parameters. A sample proportion can be used to estimate a population proportion, for example, the fraction or percentage of students who travel more than 30 minutes to and from school each day. Evaluation of sampling and data collection methods must be based on identifying features of good sample design or good experimental design. Appropriate considerations are those that would make the inference more reliable/less variable: o such as further (described) strata, o repeated sampling and averaging statistics, o context factors o relative size of the mean and standard deviation ie if the standard deviation is small in relation to the mean, then the population is likely to be closely spread about the population mean. o If the sample contains at least 30 items, it may be trivial at Level 7 to suggest a larger sample would improve the inference of a measurement. Measure An amount or quantity that is determined by measurement or calculation. The term ‘measure’ is used in two different ways in the curriculum. One use is in the terms measure of centre, measure of spread, and measure of proportion, where these measures are calculated quantities that represent characteristics of a distribution. The use of ‘using displays and measures’ in the level 6 (statistical investigation thread) achievement objective is a reference to measures of centre, spread, and proportion. The other use applies to a statistical investigation. The investigator decides on a subject of interest and then decides the aspects of it that can be observed. These aspects are the ‘measures’. Example o An investigator decides that ‘well-being’ is a subject of interest and chooses ‘happiness’ to be one aspect of well-being. Happiness could be measured by the variable ‘the average number of times a person laughs in a day’. Non-sampling error One of the two reasons for the difference between an estimate (from a sample) and the true value of a population parameter; the other reason being the error caused because data are collected from a sample rather than the whole population (sampling error). Non-sampling errors have the potential to cause bias in surveys or samples. document1 Page 3 There are many types of non-sampling errors, and the names used for them are not consistent. Some examples of non-sampling errors are: The sampling process is such that a specific group is excluded or under-represented in the sample, deliberately or inadvertently. If the excluded or under-represented group is different, with respect to survey issues, then bias will occur. The sampling process allows individuals to select themselves. Individuals with strong opinions or those with substantial knowledge will tend to be over-represented, creating bias. Bias will occur if people who refuse to answer have different views of the survey issues from those who respond. This can also happen with people who are never contacted and people who have yet to make up their minds. If the response rate (the proportion of the sample that takes part in a survey) is low, bias can occur because respondents may tend consistently to have views that are more extreme than those of the population in general. The wording of questions, the order in which they are asked, and the number and type of options offered can influence survey results. Answers given by respondents do not always reflect their true beliefs because they may feel under social pressure not to give an unpopular or socially undesirable answer. Answers given by respondents may be influenced by the desire to impress an interviewer. Sampling error The error caused because data are collected from part of a population rather than the whole population. An estimate of a population parameter, such as a sample median or sample proportion, is different for different samples (of the same size) taken from the population. Sampling error is one of two reasons for the difference between an estimate and the true, but unknown, value of the population parameter. The other reason is non-sampling error. The error for a given sample is unknown, but when sampling is random, the size of the sampling error can be estimated by calculating the margin of error. Sampling variation The variation in a sample statistic from sample to sample. Suppose a sample is taken and a sample statistic, such as a sample median, is found. If a second sample of the same size is taken from the same population, it is almost certain that the sample median found from this sample will be different from that found from the first sample. If further sample medians are found, by repeatedly taking samples of the same size from the same population, then the differences in these sample medians illustrate sampling variation. document1 Page 4 Sample size The number of objects, individuals, or values in a sample. Typically, a larger sample size leads to an increase in the precision of a statistic as an estimate of a population parameter. The most common symbol for sample size is n. Sampling Notes Reasons for sampling include time and cost considerations, lack of access to the entire population and the nature of the data collection or test, for example, blood test does not require all blood to be taken, testing breaking strain of fishing line destroys the line. Features of a good sampling technique include the sample is sufficiently large, randomly chosen and representative of the population. Sample size affects the variability of an inference. If a sample is too small, it is more likely to be unusual and less likely to be representative. As the Central Limit Theorem for sample means (a level 8 objective) applies to samples of at least 30 items, random samples of this size are acceptable. There is no statistical requirement that a sample be a proportion of the population. For an inference of a population proportion, however, a much larger sample size is needed, at least 250. This size comes from margin of error considerations (a level 8 objective) but at level 7 an intuitive understanding is sufficient. Randomised sampling techniques include simple random, systematic, stratified, cluster, and quota. It is important to identify the positive features of each method and be able to carry out each method correctly in order for the sample to be as representative as possible. Students must be able to provide evidence they have carried out their chosen sampling methods correctly. If a sample is randomly chosen then it is representative of the population. Sources of variation The reasons for differences seen in the values of a variable. Some of these reasons are summarised in the following paragraphs. Variation is present everywhere and is in everything. When the same variable is measured for different individuals, there will be differences in the measurements, simply due to the fact that individuals are different. This can be thought of as individual-to-individual variation and is often described as natural or real variation. Repeated measurements on the same individual may vary because of changes in the variable being measured. For example, an individual’s blood pressure is not exactly the same throughout the day. This can be thought of as occasion-to-occasion variation. Repeated measurements on the same individual may vary because of some unreliability in the measurement device, such as a slightly different placement of a ruler when measuring. This is often described as measurement variation. The difference in measurements of the same quantity for different individuals, apart from natural variation, could be due to the effect of one or more other factors. For example, the difference in growth of two tomato plants from the same packet of seeds planted in two different places could be due to differences in the growing conditions at those places, such as soil fertility or exposure to sun or wind. Even if the two seeds document1 Page 5 were planted in the same garden, there could be differences in the growth of the plants due to differences in soil conditions within the garden. This is often described as induced variation. Variation occurs in all sampling situations. Suppose a sample is taken and a sample statistic, such as a sample median, is found. If a second sample of the same size is taken from the same population, it is almost certain that the sample median found from this sample will be different from that found from the first sample. If further sample medians are found, by repeatedly taking samples of the same size from the same population, then the differences in these sample medians illustrate sampling variation. Statistical inference The process of drawing conclusions about population parameters based on a sample taken from the population. Example 1 o Using a sample median calculated from a random sample taken from a population to estimate the population median is an example of statistical inference. Example 2 o document1 Using data from a random sample taken from a population to obtain a 95% confidence interval for the population proportion is an example of statistical inference. Page 6 Activities big ideas Rationale different samples of the same size give different pictures of proportions showing that small samples for categorical data are unreliable because of the large variation between samples Ideas of proportion are needed in year 13 Working with categorical data – sampling variability and sample size Sampling errors - Variability of sample proportion Using populations of counters to draw samples of size 20 and observe the variability in outcomes. (some notes in introduction to making an inference about a population) to tag or not to tag http://www.censusatschool.org.nz/classroomactivities/to-tag-or-not-to-tag/ margin of error (stats teachers day 2006) Marina and Anne http://www.stat.auckland.ac.nz/~teachers/index6.php Sampling errors - Increasing sample size improves the reliability of a point estimate What size sample do we need to take to get a reliable proportion? Categorical data needs big samples sizes to be able to make a reliable estimate for the population proportions Links to probability objectives from level 6, starting to make the connections It’s a new starting point for thinking about inference Understanding of polls and their reliability as predictors is an essential tool for citizenship The media article selected should be a good hook for student engagement Statistical literacy link – sampling and non-sampling errors Working with measurement/numerical data – sampling variability and sample size Kiwi Kapers 1 http://seniorsecondary.tki.org.nz/Mathematics-andstatistics/Achievement-objectives/AO-S7-1 Introducing sampling variability through repeated sampling from the same population using sample sizes 15 and 30. Looking for patterns across repeated samples of the same size. Five samples of size 30 for each pair/group of students (keep a record of these 5 sample medians for Kiwi Kapers 3) 5 samples of 15 and 5 samples of 30 needed for Kiwi Kapers 2 document1 different samples of the same size give different pictures of the distribution of weights of kiwi repeated samples of size 15 show more variation than repeated samples of size 30 at this stage using the multiple samples to get a possible interval for the population median (early ideas that will be built on later about getting an interval estimate of the population median from a single sample) Page 7 building on level 5 and 6 understandings of the PPDAC cycle and sampling variability, introducing the new data set that will be used throughout following lessons as statistical concepts are built Activities big ideas Rationale Kiwi Kapers 2 http://seniorsecondary.tki.org.nz/Mathematics-andstatistics/Achievement-objectives/AO-S7-1 Exploring different sample sizes and mutually agreeing to a sensible size based on smaller variation of sample medians combined with practical considerations such as the efficient use of time and resources. (Catching kiwis to collect the data) deciding on the smallest sample size that will give us a reliable estimate of the population median Students need to have the opportunity to see how different sample sizes produce different variation patterns for the sample median. It is important that they engage in the debate as to what makes a good sample size. This activity is built around using Fathom to generate multiple samples, but this could easily be done using the data viewer http://www.censusatschool.org.nz/2010/data-viewer/ on CensusAtSchool to collect sample medians for samples of size 50 and 100. (Sample medians for sample size 15 and 30 can be used from the Kiwi Kapers 1 activity). Under PLAN select the Kiwi Kapers 2008 population and select total sample size of 50 and then 100 (ignore the “I am a year 12 student” at this stage). Select get my sample (this provides a sample which shows in DATA). Under ANALYSIS, select variable 1 as weight and then tick “Add summaries”. Now click “Do Analysis”. Record median. Click back in the sample size box and hit enter – generates a new sample. Repeat ANALYSIS, select variable 1 as weight and then tick “Add summaries”. Now click “Do Analysis”. Record median. Continue until X sample medians have been found. Collate class results. Making inferences in summary situations document1 Page 8 Activities big ideas Rationale Kiwi Kapers 3 (see draft ideas in word document, kiwikapers3 NOTE kiwikaper3v1 has the n=30 included, but would expect students to use their own data as in kiwikapers3, v1 is to give the sense of what is happening) Note: make a better student sheet. Thinking A4 landscape, put the popn and n=400 and then make big IQR picture, leave space for n=30, decent scale. developing the n idea for the informal confidence interval for the population median 1.5XIQR – activity to be written up, building to using 1.5XIQR/n Note: take fathom output and put into word doc. Make 1, 1.5 and 2 separate. teaching notes to support this need to really spell out how you collected the measures. EIS-T, this is to be done. Some activities to use the whole cycle and practice making informal interval estimates for the median. Note: This would include describing the interval in words. Using the entire PPDAC cycle and making an inference about the population parameter in a summary situation Money spent at canteen (see draft activity attached PPDAC L7 evaluating process- summary inference) Do NZ teenagers get enough sleep? Starting to evaluate the PPDAC cycle with a particular focus on the sampling method. Sampling Sampling stuff 1 http://seniorsecondary.tki.org.nz/Mathematics-andstatistics/Achievement-objectives/AO-S7-1 Sampling stuff 2 http://seniorsecondary.tki.org.nz/Mathematics-andstatistics/Achievement-objectives/AO-S7-1 Check re: SURFs. Making inferences in comparison situations document1 Page 9 Activities big ideas Rationale Comparing boys and girls sleep (see draft activity attached PPDAC L7 evaluating process - comparative inference activity) Note: could come back to kiwi kapers, they can use C@S and use the data viewer, tick year 12 student. Important to get the interval onto the graph, leading into year 13. document1 Page 10