Mayfield High School – Higher Tier Mayfield High is a fictitious school but the data collected is from real students. You will investigate the relationship between height and weight. There are mistakes in the data including impossible heights and weights. Locating the Mayfield Database In the school Intranet select Subject and Group Sites Then select Maths Website, followed by Mayfield and click on the word HERE You will spend some time on the computers obtaining data to work on and producing graphs and statistical calculations, but marks are awarded for your entire write up of the coursework. Your report must contain the following: 1. 2. 3. 4. 5. A pilot study (including a commentary) A detailed plan for your main study Calculations and graphs produced Detailed conclusions An evaluation Firstly, as you are only interested in height, weight, year and gender you must delete the rest of the data. Deleting Unwanted Data (leaving Year, Gender, Height and Weight) Select columns 2 to 7 From the Edit menu select Delete Now repeat this for columns 3 to 12 and 5 to 11 REMEMBER TO REGULARLY SAVE YOUR WORK INTO YOUR MY WORK FOLDER 1 Part 1 – The Pilot Study A pilot study (sometimes called a pretest) is a preliminary test to see if there is a line of enquiry to investigate further. You will be testing the following hypothesis in your pilot study: The taller you are the heavier you are Taking a Sample There are the 1183 students at Mayfield High School. The table shows the number of students in each year group for males and females. Year Group Year 7 Year 8 Year 9 Year 10 Year 11 Boys 151 145 118 106 84 Girls 131 125 143 94 86 Total 282 270 261 200 170 For your pilot study you must consider each of the 10 groups in the school (i.e Year 7 Girls, Year 7 Boys, Year 8 Girls etc) to test your initial hypotheses. To do this you will need samples from each group as the whole school is too large to work with. You should take a stratified random sample of 100 students for your pilot study. Stratified Sample - A stratified sample takes a proportional number from each group in the population so that each group is fairly represented. This is necessary when producing graphs or statistical calculations on more than one section of the population together. E.g. If there were 150 Year 7 girls in a school with 1000 students and you wanted to take a stratified sample of 100 students in total, then use the formula: No. of each group in sample = E.g. No. of Year 7 girls = no. of students in group sample size total no. of students in school 150 x 100 = 15 students 1000 2 Random Sample - Every person should have an equal chance of being chosen for your sample to make it fair and avoid bias. A quick way of doing this is to give each student a random number and then sort the data on this number which produces a random list of the data. Getting a random sample: 1. Type =RAND() in the first free cell to the right of the first line of data and press Enter to insert a random number. 2. Click on this cell again and move the cursor to the bottom right of the cell until it changes to a black cross. Drag down until you reach the bottom of the data. 3. To mix up the data, highlight the cell to the right of the first random number. Select the Data menu and Sort. Sort by column 5 and this will mix up all the data. 4. Now select Data menu and Sort and Sort by Year Group Then by Gender. 5. Select the number of calculated students from each group and copy to a separate sheet Calculations and evidence required for your pilot study: One scatter graph showing all sample for pilot study Correlation coefficient for all heights and weights in your sample Scatter Graphs – Are used to compare the relationship (correlation) between two types of data. Correlation Coefficient – This is a more accurate method to compare correlation. It uses the mean of each set of data and looks at the distance away from the mean of each point. The formula, which is known as the Product Moment Correlation Coefficient or r is r ( x x)( y y) ( x x) ( y y ) 2 2 (where x and y are the means of the x and y values respectively) The value of r determines correlation. It is always between –1 and 1. 3 -1 = Perfect Negative Correlation -0.8 = Good Negative Correlation -0.5 = Some Negative Correlation 0 = No Correlation 1 = Perfect Positive Correlation 0.8 = Good Positive Correlation 0.5 = Some Positive Correlation A line of best fit should only be drawn on a scatter graph if the correlation coefficient is >0.6 or <-0.6 The reason for displaying the equation of line of best fit is that it can be used to make predictions. E.g. If the equation for Year 8 boys is y = 50x-40 this means for a boy in Year 8 his weight can be predicted if you know his height from calculating weight = 50xheight – 40. r2 (called R2 in Excel) is the square of the correlation coefficient and allows you to look at the likelihood of obtaining correct predictions from a line of best fit. R2 is the likelihood that an increase in x will produce an increase in y. (i.e. that an increase in height will mean an increase in weight). E.g. If the Correlation coefficient, r = 0.8 then r2 = 0.64 = 64% That is a 64% chance that from any point on the line increasing the height will result in an increase in weight. To Draw a Scatter Graph in Excel: 1. Highlight the two columns of data 2. Click on Chart Wizard (Bar chart icon on tool bar) 3. Choose XY(Scatter) 4. Enter chart title and label axes(remember units!) 5. In Legend untick box labelled Show Legend 6. Choose whether to save as separate chart or on sheet To Improve Presentation: Right click on x-axis and select format axis, choose scale and change minimum value. Can repeat for y-axis if necessary. To Put on a Line of Best Fit (only if strong enough correlation): Right click on a point in the scatter graph, select add trendline. In options tick boxes to display equation and R-squared on graph. 4 To Calculate Correlation Coefficient in Excel: 1. Select a blank cell in the spreadsheet 2. Click on fx on the tool bar 3. Select Statistical in the function category 4. Select Correl in the function name and then click ok 5. In Array 1 highlight the heights 6. Click in Array 2 and highlight all the weights 7. Click ok Part 2 – The Main Study Based on the results of your pilot study, you should have a line of enquiry to investigate further and should refine your hypotheses to include how you think age and gender will affect results. E.g. Hypothesis 1 – The taller you are the heavier you are. The relationship between the two will become stronger as you get older. Hypothesis 2 – Boys are taller and heavier than girls. The difference between boys and girls will increase as the students get older. Sampling For your main study you will deal with each of the 10 groups in the school separately to make comparisons across year groups and gender. To do this you will need larger samples from each group. 30 students from each group should be enough to perform statistical calculations on. As your data has already been randomised you simply need to return to this and take the first 30 from each group. Anomalies Once you have taken your sample, you may notice anomalies (extreme values in your data, like someone who is 4.65m tall or 5kg in weight). These anomalies may be outliers (i.e. results differing greatly from others in the same sample). You should test for these outliers and then decide how to deal with them (leave them in or replace them). Keep your graphs from before and after. 5 There are two methods to test for outliers: 1. Using Standard Deviation A piece of data is considered an outlier if it is more than two standard deviations away from the mean of the data set. e.g. The mean of a set of heights is 1.54m, the standard deviation is 0.11 Is the height 1.25 in the data set is an outlier? 2xstandard deviation = 2 x 0.11 = 0.22 1.54-0.22=1.32 1.25 is smaller than 1.32 and so is more than 2 standard deviations from the mean, making it an outlier. 2. Using Interquartile Range (IQR) A piece of data is considered an outlier if it is more than 1.5 times the Interquartile range above the upper quartile (UQ) or below the lower quartile (LQ). e.g. The Lower quartile for a set of data is 6, the upper quartile is 9.5. Is the value 15 in the data set an outlier? IQR = UQ – LQ = 9.5 – 6 = 3.5 Multiply the IQR by 1.5: 3.5 x 1.5 = 5.25 Any values more than 5.25 below the upper quartile or more than 5.25 above the upper quartile are outliers. UQ + 5.25 = 9.5 + 5.25 = 14.75 15 is larger than 14.75 and so is an outlier. Replacing Anomalies If you find an outlier in your data you must deal with it and include evidence. If it is only just an outlier in your test you may choose to leave it in, but you must explain why you have chosen to do this. If it is a clear outlier you must remove it and replace it with a new person from your original randomised data. Your scatter graph and calculations will automatically update, so remember to print out any graphs and calculations before deleting to use as evidence in your report. Do not forget that if you replace a piece of data, this will also need testing (N.B. The standard deviation or interquartile range used will be a different value from before). 6 Calculations required for the Main Study 1. Scatter graphs for each group 2. Correlation coefficient for each group 3. Line of best fit, equation and R2 on scatter graphs (only if correlation coefficient is >0.6) 4. Mean and standard deviation for heights (or weights) for each group 5. Minimum, lower quartile, median, upper quartile and maximum for weights (or heights). From this you can find the range and interquartile range and produce box and whisker diagrams (these should be done by hand on graph paper – putting them all on one sheet along the same scale will make comparisons easier) Standard Deviation – Standard deviation (represented by the symbol ) looks at how spread out the data is. It is obtained by looking at how far each individual value is away from the mean. It is calculated using the formula: ( x x) 2 n where x is the mean of the data set and n is the number of values The larger the value obtained, the further the values are from the mean. Standard deviation is a way of comparing two sets of data. Standard deviation can be calculated in different ways depending upon the type of data. In Excel the function stdevp uses the above formula which you need for your data. To Calculate Standard Deviation in Excel: 1. Select an empty cell 2. Click on fx on the tool bar 3. Select Statistical in the function category 4. Select Stdevp in the function name and then click ok 5. Highlight the first item in list and drag down to highlight all the data in the column, which will appear in the number 1 box (ignore number 2 box) and click ok To Calculate the Mean in Excel: 1. Click in a blank cell 2. Click on fx on the tool bar 3. Select Statistical in the function category 4. Select Average in the function name and then click ok 5. Highlight the list of numbers you require the mean for, which will appear in the number 1 box (ignore number 2 box) and click ok 7 Interquartile Range – This is also a measure of spread but looks at the spread of the middle 50% of the data around the median. It is found by subtracting the lower quartile from the upper quartile (calculating UQ-LQ). To Calculate Quartiles in Excel: The Lower Quartile 1. Click in an empty cell 2. Click on fx on the tool bar 3. Select Statistical in the function category 4. Select Quartile in the function name and then click ok 5. Highlight the column of data, which will appear in the array box 6. Click in quart box and type 1 7. Click ok To calculate the Minimum, repeat as above but type in 0 instead of 1 in the quart box To calculate the Median, repeat as above but type in 2 in the quart box To calculate the Upper Quartile, repeat as above but type in 3 in the quart box To calculate the Maximum, repeat as above but type in 4 in the quart box Drawing Box and Whisker Diagrams 1. Use graph paper 2. Draw a horizontal scale. It is a good idea to draw all the box plots on one piece of graph paper using the same scale. You will then be able to compare your results in your conclusion. 3. Mark on the 5 pieces of data you have found and draw in the box and whiskers as shown on the next page. Median Lower Quartile Upper Quartile Minimum Maximum 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 8 Summary Tables Summary tables are used to make comparisons between years and gender easier. You should include key pieces of data in your summary table(s): Correlation Coefficient Mean and Standard Deviation Median and Interquartile Range E.g. Group Year 7 Year 8 Year 9 Year 10 Year 11 Correlation Coefficient Height Weight Standard Deviation Mean Median IQR Male Female Male Female Male Female Male Female Male Female Bar charts can also be a way to compare these values visually. E.g. Mean Heights Mean Height (m) 2 1.5 Girls 1 Boys 0.5 0 Year 7 Year 8 Year 9 Year 10 Year 11 Year Group 9 Conclusions Your conclusions very important and must be detailed, but clear and not waffle! Any calculations which you do not mention are considered irrelevant data. Make statements and comparisons from your calculations across year groups and gender and link these to your original hypotheses. (More info in Your Report section of this booklet) Evaluation You must consider how reliable your findings are. Can you use them to make correct predictions for the whole school? Could they be used to make predictions for your school? One way of considering the limitations of your sample is to compare the key results in your summary table to calculations for the whole population (use Excel to obtain these calculations) 10 Your Report Your write up must include the following: Pilot Study Brief introduction to the coursework task Why and how pilot study was carried out Why and how a stratified random sample was taken and any possible limitations Findings from pilot study Plan for Main Study State refined hypotheses (to include variation across years and gender) How you will prove/disprove your hypotheses (i.e. What calculations and analysis you will make- Explain which calculations will prove which hypothesis and WHY you use them/what they show) Why/how sample is taken; How many; What groupings How you will eliminate bias/deal with outliers (include evidence) Represent Scatter graphs for each group, correlation coefficient, line of best fit and equation (where appropriate) Min, max, lower and upper quartiles and median for heights or weights of each group Box and Whisker diagrams (on same sheet for later analysis) Mean and standard deviation for heights or weights. Summary table(s) of calculated values Conclusions Correlation shown on each scatter graph and what they evidence Comparison of correlation values across the groups Explain your equations of lines of best fit and use to make predictions/inferences (include calculations) Comparison of height/weight between boys/girls Comparison of height/weight across year groups Refer to box and whisker diagrams and what they demonstrate Refer to mean and standard deviation and what they demonstrate Relate findings to original hypotheses. Do your findings support them? Give/suggest reasons for your findings Give/suggest reasons for exceptions to patterns in your data Evaluation Was your sample a good/fair/representative sample? Any limitations of the project? Any problems you faced and changes you would make 11 12