Shoreham Academy A-Level Statistics Statistics 1B Dean Hubbard AQA Dean Hubbard © Contents Formulae Booklet Formulae to Learn Numerical Measures Probability Binomial Distribution Normal Distribution Estimation Correlation and Regression Each topic will include Exam Tips How to use a Calculator/Graphical Calculator Worked Examples Formulae Booklet On this page you will find Formulae so you can refer to them This also includes Cumulative Binomial Distribution Tables Normal Distribution Tables Percentage Points of the Normal Distribution From Page 10 and onwards the booklet is used for Statistics Whilst up to page 8 there are useful formulae for Core Maths These are the formulae which you don’t need to learn but can refer to and are needed in the Stats 1B Exam. The Binomial Distribution Tables are found on Page 15 – Page 20. This includes up to “n=50” The Normal Distribution Tables and Percentage Point of the Normal Distribution are located on pages 24 and 25 respectively. Formulae to Learn These are useful formulae that you may want to learn as they could end up being the exam. Some of them will need to be learned, but if not you can always use your calculator to work the answer out. The calculator method will be used in the topic later on. Samples ∑π₯ π₯Μ = Mean Μ π₯ = Variance s² = s² = raw data π ∑ ππ₯ ∑π ∑(π₯−π₯Μ )2 grouped or tabled data (where x is the midpoint) raw data π−1 1 {∑ ππ₯ ² π−1 − (∑ ππ₯)² } ∑π grouped or tabled data (where x is the midpoint) s = √π£πππππππ Standard Deviation Populations Mean μ = μ= Variance σ² = ∑π₯ ∑ ππ₯ ∑π grouped data (where x is the midpoint) ∑(π₯−π₯Μ )2 raw data π 1 σ² = π {∑ ππ₯ ² − Standard Deviation raw data π (∑ ππ₯)² } ∑π grouped data (where x is the midpoint) σ = √π£πππππππ (Refer to page 12 of this booklet for the meaning for these probabilities) P (not A) = P (π΄′ ) = 1 − π(π΄) (complementary events) P (A or B) = P (A β π΅) = P (A) + P (B) – P (A β π΅) if not mutually exclusive P (A or B) = P (A β π΅) = P(A) + P(B) If P (A or B) = 1 - if mutually exclusive the events A and B are exhaustive P (A and B) = P (A β π΅) = P(A) x P(B) if independent events If events are independent: P(A|B) = P(A|π΅′ ) = P(A) P(B|A) = P(B|π΄′ ) = P(B) Scientific Calculator Statistics Mode [MODE] [2] (STAT) [1](1-VAR) Graphical Calculator To get to Statistics option [MENU] [2] (STAT) 1 Variable Data 1 Variable Data Enter your Data E.g. 25, 34, 13, 25, 34, 13 Enter your Data E.g. 25, 34, 13, 25, 34, 13 Work out the mean Μ ) [=] [AC] [SHIFT] [1] (STAT) [5] (Var) [2] (π = 24 To work out all the information needed [F2] (CALC) [F6] (SET) [DOWN] [F1] (1) [EXIT] Option should say 1Var Freq : 1 (this makes sure the calculator knows it is dealing with one variable) Population Standard Deviation [AC] [SHIFT] [1] (STAT) [5] (Var) [3] (πππ) [=] = 8.602325267 Sample Standard Deviation [AC] [SHIFT] [1] (STAT) [5] (Var) [4] (xππ − π) [=] =9.42447519 Number in Sample Group [AC] [SHIFT] [1] (STAT) [5] (Var) [1] (π) [=] =6 [F1] (1VAR) (If you were to enter the e.g data you will get this as a list) π₯Μ = 24 ∑ π₯ = 144 ∑ π₯ 2 = 3900 ππ₯ = 8.60232526 π π₯ = 9.42337519 π=6 To work out ∑ π [AC] [SHIFT] [1] (STAT) [4] (Sum) [2] (∑ π) [=] = 144 To work out ∑ ππ [AC] [SHIFT] [1] (STAT) [4] (Sum) [1] (∑ ππ )[=] =3900 2 Variable Data [MODE] [2] (STAT) [2] (A+BX) 2 Variable Data [MENU] [2] (STAT) Enter your data in X and Y E.g. X = 25, 34, 13, 25, 34, 13 Y = 17, 3, 4, 25, 30, 32 Enter your data in List 1 and List 2 E.g. X = 25, 34, 13, 25, 34, 13 Y = 17, 3, 4, 25, 30, 32 Regression [AC] [SHIFT] [1] (STAT) [7] (Reg) (This is for y= a + bx, if you were looking for y = ax + b you’re a + b values would flip) [AC] [SHIFT] [1] (STAT) [7] (Reg) [1] (A) [=] = 19.7972973 [AC] [SHIFT] [1] (STAT) [7] (Reg) [2] (B) [=] = -0.05405405405 [AC] [SHIFT] [1] (STAT) [7] (Reg) [3] (r) [=] = -0.04003237259 Regression [F2] (CALC) [F6] (SET) – (Change Settings to these) 1Var XList: List1 1Var Freq: 1 2Var XList : List1 2Var YList: List2 2Var Freq : 1 [F2] (CALC) [F3] (REG) [F1] (X) – For regression Line ax + b [F1] (aX+b) For regression Line a+bx [F2] (a+bx) Using the example data and if y = ax + b Your calculator should display this; a = -0.54054 b = 19.7972972 r = -0.0400323 π 2 = 1.6025πΈ − 03 Mse = 202.050675 Using example data and if y= a +bx Your calculator should display this; a = 19.7972972 b = -0.54054 r = -0.0400323 π 2 = 1.6025πΈ − 03 Mse = 202.050675 2 Variable Variance The Sample Number [AC] [SHIFT] [1] (STAT) [5] (Var) [1] (n) [=] =6 The Mean of X Variable Μ ) [=] [AC] [SHIFT] [1] (STAT) [5] (Var) [2] (π = 24 Population Standard Deviation X Variable [AC] [SHIFT] [1] (STAT) [5] (Var) [3] (πππ) [=] = 8.602325267 Sample Standard Deviation X Variable [AC] [SHIFT] [1] (STAT) [5] (Var) [4] (xππ − π) [=] = 9.42447519 The Mean of Y Variable 2 Variable Variance [F2] (CALC) [F2] Using the example data, the calculator shall display this 2 - Variable π₯Μ = 24 ∑ π₯ = 144 ∑ π₯ 2 = 3900 ππ₯ = 8.60232526 π π₯ = 9.42337519 π=6 π¦Μ = 18.5 ∑ π¦ = 111 ∑ π¦ 2 = 2863 ππ¦ = 11.6153633 π π¦ = 12.723993 ∑ π₯π¦ = 2640 Μ ) [=] [AC] [SHIFT] [1] (STAT) [5] (Var) [5] (π = 18.5 Population Standard Deviation Y Variable [AC] [SHIFT] [1] (STAT) [5] (Var) [6] (πππ) [=] = 11.61536339 Sample Standard Deviation Y Variable [AC] [SHIFT] [1] (STAT) [5] (Var) [7] (yππ − π) [=] = 12.72399308 To Work out ∑ ππ [AC] [SHIFT] [1] (STAT) [4] (Sum) [1] (∑ ππ ) [=] = 3900 To Work out ∑ π [AC] [SHIFT] [1] (STAT) [4] (Sum) [2] (∑ π) [=] = 144 To Work out ∑ ππ [AC] [SHIFT] [1] (STAT) [4] (Sum) [3] (∑ ππ ) [=] = 2863 To Work out ∑ π [AC] [SHIFT] [1] (STAT) [4] (Sum) [4] (∑ π) [=] = 111 To Work out ∑ ππ [AC] [SHIFT] [1] (STAT) [4] (Sum) [5] (∑ ππ) [=] Binomial Distribution Probabilities of type P(X = x): F5 (DIST) F5 (BINM) F1 (Bpd) eg B( 15, 0.2) P( X = 3) Data F2 (Variable) x 3 EXE Numtrial 15 EXE p 0.2 EXE Execute F1 (calc) p(x) = 0.25013 Probabilities of type P(X ≤ x): F5 (DIST) F5 (BINM) F2 (Bcd) eg B( 15, 0.2) P( X ≤ 3) Data F2 (Variable) x 3 EXE Numtrial 15 EXE p 0.2 EXE Execute F1 (calc) p(x) = 0.64816 This function can also be used for probabilities of the type P(X > x). For example, P(X > 3) = 1 P( X ≤ 3) = 1 – 0.64816 = 0.35184 Normal Distribution: Calculation of probabilities Probabilities of types P(X ≤ x), P(X < x), P(X > x) and P(X ≥ x) can be found directly from : F5 (DIST) F1 (NORM) F2 (Ncd) eg X~ Normal mean 135 st deviation 15 P(X ≤ 127) or P(X < 127) Lower -EXP 99 EXE (lower end needs to be an extremely small value) Upper 127 EXE σ 15 EXE μ 135 EXE Execute F1 (CALC) prob = 0.2969 eg X~ Normal mean 135 st deviation 15 P(X ≥ 118) or P(X >118) Lower 118 EXE Upper EXP 99 EXE (upper end needs to be an extremely large value) σ 15 EXE μ 135 EXE Execute F1 (CALC) prob = 0.87146 Problems involving inverse normal probabilities : F5 DIST F1 NORM F3 InvN eg X~ Normal mean 135 st deviation 15 Find value of x such that P(X< x) = 0.15 ( or P(X ≤ x) = 0.15 ) Area ( to the left) 0.15 EXE σ 15 EXE μ 135 EXE Execute F1 (CALC) x = 119.45 eg X~ Normal mean 135 st deviation 15 Find value of x such that P(X > x) = 0.30 ( or P(X ≥ x) = 0.30 ) Area ( to the left) 0.70 EXE ( area to left is 1 – 0.30) σ 15 EXE μ 135 EXE Execute F1 (CALC) x = 142.86 Estimation – Confidence intervals for mean (known variance or large sample) In both situations, F4 (INTR) F1 (Z ){ since Normal distribution}) F1 (1-S) eg A random sample of 12 packets of crisps with nominal weight 35g is obtained. The weights of such packets can be modelled by a normal distribution with standard deviation 3.5g. The weights of each of the packets in the sample are: 34.6 35.9 37.4 36.2 37.5 34.9 35.8 38.2 37.8 37.2 35.1 35.9 Calculate a 95% confidence interval for the mean weight. Enter data into MENU STAT List 1 F4 (INTR) F1 (Z) F1 (1-S) F1 (List) cursor down to C- Level 0.95 (as 95%) EXE σ 3.5 EXE List List 1 (F1) cursor down Freq 1 (F1) EXE Execute F1 (CALC) Left = 34.394 Right = 38.355 so the interval is { 34.394 , 38.355 } Also given πΜ = 36.375 and n = 12 information on s ( σn-1 ) should be ignored. By using these values you can sub them into the formula for Product Moment Correlation Coefficient (As seen below) Numerical Measures Mean : - Sum of products ÷ Number of products It is important as it uses all the data values If the data is grouped – use the mid-point of each group as your x Disadvantage – the mean is affected by extreme values Median: - It is the middle value when the data is in order Not affected by extreme value Mode: - This is the most common value in the sample It is not suitable if the data values are very varied Range: - The biggest value – smallest value It is affected greatly by extreme values Interquartile Range: - Upper Quartile – Lower Quartile Measures the spread of middle 50% of the data is not affected by extreme values Lower Quartile: - 1π 4 (n = the sample size) – rounded to the next nearest number Upper Quartile: - 3π 4 (n = the sample size) – rounded to the next nearest number Standard Deviation: - Is the spread of data either side of the mean Scaling Data ο· ο· Addition if you add a to each number in the list of data: New Mean = Old Mean + a New Median = Old Median + a New Mode = Old Mode + a Standard Deviation is UNCHANGED Multiplication if you multiply each number in the list of data by b New Values = Old values x b (including old standard deviation x b) EXAMPLE Find the median and quartiles of the following data: 2, 5, 3, 11, 6, 8, 3, 8, 1, 6, 2, 23, 9, 11, 18, 19, 22, 7 First put the list in order: 1, 2, 2, 3, 5, 6, 6, 7, 8, 8, 9, 11, 11, 18, 19, 22, 23 π You need to find π1 , π2 and π3 , so work 4 = 1) π 4 18 π , 4 2 = 18 3π , πππ 4 2 = 54 4 is not a whole number (= 4.5), so round up and take the 5th term: π1 = 3 2) 3) π 7+8 is a whole number (= 9), so find the average of the 9th and 10th terms: π2 = 2 2 3π th is not a whole number (= 13.5), so round up and take the 14 term: π3 = 11 4 = 7.5 a) i) = Modal Value is 23, because as you can see there are 34 days that 23 letters are collected ii) = Median is 22, because you divide your sample size by 2. 175 ÷2 = 84.5. Since this is not a whole number you round up to 85. Therefore you take the 85th value which lies in 22 daily number of letters. = LQ is 20, due to the fact you divide your sample size by 4. 175 ÷ 4 = 43.75, you then round it up giving you a value of 44. Therefore you take the 44th value which lies in the 20 daily number of letters. = UQ is 23, the sample size divided by 4 and multiplied by 3. 175 ÷ 4 x 3 = 131.25, (you can also work this out by 3(175) 4 if you were to use the formula). Since you can’t get a value of 131.25 days, you must round up to the next number which is 132. Then you take the 132nd value which is the 23 daily number of letters. = IQR = 3 , as you use the formula of UQ – LQ (23 – 20) b) = Mean = 22.3 = Standard Deviation = 6.37 or 3.39 To get the full 4 marks on the question, you have to get these specific values. To do this the group data you must take the mid-point and also remember to change your max value to 54. Your mid-points should be as listed; Daily Numbers of letters 0-9 10-19 20 21 22 23 24 25-29 30-34 35-39 40-49 50 or more Total( ∑) Mid-points (x) 4.5 14.5 20 21 22 23 24 27 32 37 44.5 54 Use this formula Μ π₯ = ∑ ππ₯ ∑π Number of days (f) 5 16 23 27 31 34 16 10 5 3 4 1 175 π(π₯ 2 ) 101.25 3364 9200 11907 15004 17986 9216 7290 5120 4107 7921 2916 94132.25 (fx) 22.5 232 460 567 682 782 384 270 160 111 178 54 3902.5 to work out the mean of grouped data (You can also work it out via the calculator, by inputting the data which is highlighted. Then following the steps on pages 5 and 6, but unfortunately a scientific calculator cannot deal with a 2 Variable Frequency so you have to use the Sum Option) So 3902.5 175 = 22.3 1 For Standard deviation use this formula s² = π−1 {∑ ππ₯ ² − 1 into the formula so, s = √π−1 {∑ ππ₯ ² − (∑ ππ₯)² } ∑π 1 (∑ ππ₯)² } ∑π , inputting the data , √174 {94132.25 − 3902.52 } 175 = 6.39 It does not matter which formula you use, you can use either population or sample data – the mark scheme allows both answers. You can get the answers straight from the graphical calculator if you change a few settings to allow for 2 Variable Frequency, to do this you must [F2] (CALC) [F6] (SET) – (Change Settings to these) 1Var XList: List 1 1Var Freq: 1 2Var XList : List 1 2Var YList: List 2 2Var Freq: List 2 Then press [F2] (2Var) and it shall display this list as shown π₯Μ = 22.3 ∑ π₯ = 3902.5 ∑ π₯ 2 = 94132.25 ππ₯ = 6.37248549 π π₯ = 6.390771 π = 175 π¦Μ = 23.02171428 ∑ π¦ = 4063 ∑ π¦ 2 = 110479 ππ¦ = 9.60587575 π π¦ = 9.63343929 ∑ π₯π¦ = 88186.5 As you can see you can see all information is correct, if you were to have the settings as; 1Var XList: List 1 1Var Freq: 1 2Var XList : List 1 2Var YList: List 2 2Var Freq : 1 This would be incorrect since you are taking the 2Variable Frequency from List 1 EXAM TIP: If you were working it out using a graphical calculator and you were unsure whether it was in the correct settings. Just use the scientific calculator to work it out since there are no settings needed to be changed. c) Mean = 23.9 Since the sum of x is now 280. Use the formula, 280 ÷ 175 = 1.6 Now you just add the new mean to the old mean, 22.3 + 1.6 = 23.9 Probability Mutually Exclusive Events two or more events that cannot happen at the same time e.g. passing and failing and exam P (A πΌ B) = P(A) + P(B) Independent Events One event has no effect on another event occurring e.g. throwing a 1 or a 2 on a fair dice P (A ∩ B) = P(A) x P(B) Exhaustive Events Any events where all possible outcomes are included e.g. throwing a head or a tail on a fair coin Conditional Probability When the outcome of the first even affects the outcome of a second event depends on what has happened Notation EXAMPLE A and B are two events, with P(A) = 0.4, P(B | A) = 0.25, and P(A’ ∩ B) = 0.2 a) Find i) P(A ∩ B), ii) P(A’), iii) P(B’ | A), iv) P(B | A’), v) P(B), vi) P(A | B) b) Say whether or not A and B are independent a) i) π(π΅ | π΄) = π (π΄ ∩π΅) π(π΄) = 0.25, so by rearranging the formula you get P(A ∩ B) = P(B | A) x P(A) = 0.25 x 0.4 = 0.1 ii) P (A’) = (Use formula above in notation) = 1 – 0.4 = 0.6 iii) P (B’ | A) = 1 – P (B | A) = 1 – 0.25 = 0.75 - (Since P(B’ | A) + P(B | A) = 1, so we rearrange it) iv) π(π΅ | π΄′) = π(π΅ ∩ π΄′ ) π(π΄′ ) 0.2 π = 0.6 = π (Since P (B ∩ A’) = P (A ∩ B’) , need to learn the formula) 1 3 v) P (B) = P (B | A) P (A) + P(B | A’)P(A’) = (0.25 x 0.4) + ( x 0.6) = 0.3 vi) P (A | B) = π( π΄ ∩ π΅) 0..1 = 0.3 π(π΅) 1 =3 b) If P (B | A) = P (B), then A and B are independent But P(B | A) = 0.25, while P(B) = 0.3, so A and B are not independent a) i) P(B = 3) = 194 640 = 0.303 Since it says exactly 3 bedrooms you must put ‘=’. You follow the 3 bedrooms on the table to the total which gives you 194 then divide it by the total of everything which is 640. 563 ii) P(T ≥ 2) = 640 = 0.879, At least 2 bathrooms, this includes 2 and everything above to hence the ‘≥’ sign. You can get this answer by various ways of adding or subtracting, the way I did it was to subtract 77 from 640 (640 – 77 = 563) then divide by the total. The number 77 was picked due to the fact the answer is more than 2 and the only number below 2 which is in the table is 1. If you look down the first column the total number of houses with 1 toilet is 77. iii) P(B = 3 & T ≥ 2) = 194−7 = 640 0.292, Again look at what the question asks you and use the correct notation. On this question you can do it various ways too, the way I did it was to just look at the 3 number of bedrooms row. The total which is 194,( since we worked it out in part i) but then it asked AND at least 2 toilets, we look at every value above 2 toilets in the 3 bedroom row. We do not need the value of 1 toilet in a 3 bedroom house, therefore we do 194 – 7 divided by the total. iv) P(B ≤ 3 | T = 2) = 172−19 172 = 0.889, The denominator is different in this question, due to what it asks you. “Given that it has exactly 2 toilets”, therefore we must look down the 2 toilets column and the total is 172 (which is your denominator) Now we must look at the ‘at most 3 bedrooms’, the property must not have any more than 3 bedrooms, and there are 19 properties with 4 bedrooms which we must not include. Therefore we must do 172 – 19 = 153 then divide it by the total number of properties that have exactly 2 toilets. The answers from the previous 4 questions can be left as decimals, percentages or fractions. But you will be penalised if you have the answers mixed, this means if you answer the first part as a decimal, then the rest of the answers should also be decimals. b) 0.303 x 0.879 ≠0.292 , therefore they’re not independent P(A and B) = P(A β π΅) = P(A) x P(B) This is the equation which is needed to get the marks for this question, the equation that is shown states to be independent then the probability of events A and B happening = the probability of A x the probability of b. The question asks you to use relevant answers from part a), the P(A) is P(B = 3), P(B) is P(T ≥ 2) and P(A β π΅) is P(B = 3 & T ≥ 2). Now we have established each part of the equation we can sub it in, 0.303 x 0.879 (= 0.266) ≠ 0.292 therefore they’re not independent. c) P(2π ∩ 3π ∩ ≥ 4π | π΅ = 3) = 72 194 x 99 193 x 16 192 =0.01586453715 x 6 = 0.0951872229 = 0.0952 The questions states, exactly 3 bedrooms, with one property having 2 toilets, another 3 toilets and the other at least 4 toilets. The total will be 194 as that is how many properties have 3 bedrooms, but the total will lower by one each time. This is because you’re choosing 3 properties in a row; you’d have to take into consideration the ones you’ve picked. As you can see in the method you do 72 99 16 this, 194 x 193 x 192 . Now you have to multiply your answer by 6, this may seem confusing at first but it is best described via a probability tree. Without one it’s hard to show but you multiply by 6, as there is 6 different ways of choosing the 3 properties randomly. Binomial Distribution The Binomial Distribution, B(n,p) A random variable x follows a Binomial Distribution as long as 5 conditions are satisfied. 1) There is a fixed number (n) of trials 2) Each trial involves either “success” or “failure” 3) All trials are independent 4) The probability of “success” (p) is the same in each trial 5) The variable is the total number of success in the n trials π Formula: P(X = x) = ( ) x π π₯ x (1 − π)π−π₯ π₯ Mean of Binomial Distribution If X ~ B (n,p) then : Mean (or expected value) = π = πΈ(π₯) = ππ (It’s a theoretical mean – the mean of experimental results is unlikely to match it exactly) Variance of Binomial Distribution If X ~ B(n,p) then: Variance= Var(x) = π 2 = ππ(1 − π) = πππ Standard Deviation = π = √ππ(1 − π) = √πππ EXAMPLE A Student has to take a 50-question multiple-choice exam, where each question has five possible answers, of which only one is correct. He believes he can pass the exam by guessing answers at random. a) How many questions could the student, be expected to guess correctly? b) If the pass mark is 15, what is the probability that the student will pass the exam? c) If the examiner decides to set the pass mark, so that it is at least 3 standard deviations above the expected number of correct guesses. What should the minimum pass mark be? Let X be the number of correct guesses over the 50 questions. Then X ~ B (50, 0.2) a) E(X) = np = 50 x 0.2 = 10 b) P(X ≥15) = 1 – P(X < 15) = 1 – P(X ≤ 14) = 1 – 0.9393 = 0.0607 c) Var(X) = np(1 - p) = 50 x 0.2 x 0.8 = 8 – so the standard deviation = √8 = 2.828 So the pass mark needs to be at least 10 + (3x2.828) ≈ 18.5 – i.e. the minimum pass mark should be 19 6) a) i) = 0.149 U ~ B (30, 0.13) P (P=2) = (30 )(0.13)2 (0.87)28 2 Unfortunately for the first question you can’t use the binomial tables in the formula booklet. Due to your p value being 0.13, which is not included on the tables. You have to use the binomial distribution formula, (ππ₯)π π₯ (1 − π)π−π₯ , sub your values into the equation, (30 )(0.13)2 (0.87)28. Use your calculator to work the answer out, which is 0.14890697 then 2 round to 3.d.p Or if you have a graphical calculator you can work it out via that, [MENU] [2] (STAT) [F5] (DIST) [F5] (BINM) [F1] (Bpd) Change the screen to this; Data : Variable π₯βΆ2 Numtrial: 30 P : 0.13 Save Res: None Execute [Calc] Binomial P.D P = 0.14890697 P = 0.149 ii) p = 0.35 P(R ∪ P > 10) = 1 – 0.5078 = 0.4922 To get your p value of 0.35, you have to add the Value of red which is 0.22 to purple which is 0.13. 0.22 + 0.13 = 0.35. You can look on page 18 of the formula booklet, for the n = 30 table. Look along the top for the p value of 0.35 then down to where x = 10. The value should be 0.5078, but since the tables show when P( π ≤ π₯ ) and you want more than. You need to do 1 – 0.5078 to get the correct answer of 0.4922. Or by using a graphical calculator, [MENU] [2] (STAT) [F5] (DIST) [F5] (BINM) [F2] (Bcd) Change screen to this; Data : Variable π₯ βΆ 10 Numtrial: 30 P : 0.35 Save Res: None Execute [Calc] Binomial C.D P = 0.5075821 P = 0.5076 The graphical calculator is more accurate than the binomial distribution tables, but when you do 1 – 0.5076, the answer falls in the range on the mark scheme. iii) P(5 ≤ G ≤ 10) = 0.9744 – 0.2552 = 0.719 For this question, we want the values of P(G = 5) + P(G = 6) + P(G = 7) + P(G = 8) + P(G = 9) + P(G = 10) to get the correct answer but there is a trick for this. We need the values up to 10, so out notation will be P(G ≤ 10) , in the question it asks at least 5. This eliminates the values of 4 or less, since we now know this we can do P(G ≤ 10) – P(G ≤ 4). P(G ≤ 10) = 0.9744 , P(G ≤ 4) = 0.2552. 0.9744 – 02552 = 0.719 will give you full marks. If you did the sum of them as mentioned in the beginning, you would only get 1 mark for the correct answer. b)i) Mean = 100 x 0.22 = 22 Variance = 100 x 0.22 x 0.78 = 17.16 In this question, the number of ‘n’ has changed from 30 to 100. This means to get the mean we do ‘ππ’, which is 100 x 0.22 to get 22. Then to get the new variance we sub our new numbers into the formula of’ πππ’ which is 100 x 0.22 x 0.78 to get 17.16 ii) 22.1 = 22, so reject claim (that p > 0.22) √17.16 = 4.14 ≈ 4.17 Reject claim that not random samples The claim suggests that the proportion of red paper clips in the bin is (p > 0.22), but the analysis is 22.1. They are similar to each other, to 1.d.p. But we reject the claim as p is not greater than 0.22 To find out if the Standard deviation we need to Square Root our variance from part b, √17.16 = 4.14. The two standard deviations do not match, therefore we reject the claim that they’re not random samples, since the claim does not match with the analysis. EXAM TIP: Numerical Justification is easy marks; just take your time to answer it correctly. You’re just comparing to items of data, make sure you use the data provided in the question to your answer. Normal Distribution Normal Distribution N (π , π 2 ), π−π π =π§ If X is normally distributed with mean (π) and variance (π 2 ), it is written as X ~ N (π , π 2 ) The standard normal distribution ‘Z’ has mean 0 and variance 1 .e.g.. Z ~ N (0,1) EXAMPLE X ~ N ( 4, 0.52 ) a) Find P (X ≤ 4.5) =P(Z≤ 4.5−4 ) 0.5 = P ( Z ≤ 1.0 ) = 0.84134 The normal distribution is ‘Bell-Shaped’, most things in real life are most likely to fall ‘somewhere in the middle’. They are less likely to take the ‘extremely high’ or ‘extremely low’ values. This kind of situation is called a normal distribution. There is a peak in the middle at the mean, the graph is symmetrical. This means that the values are the same distance above and below the mean are equally likely. The most important normal distribution is the standard normal distribution, or Z – this has a mean of zero and a variance of 1. Use tables to work out probabilities of Z You look up a value of z and these tables labelled ΙΈ(π§)) tells you the probability Z ≤ z > or ≥ (1 – ‘N’) < or ≤ ‘N’ If Z score is negative, you have to flip the inequality EXAMPLE (Use the tables on page 24 of the formula booklet) P (Z < 0.1) = 0.53983 P (Z > 0.23) = 1 – P (Z ≤ 0.23) = 1 – 0.59095 = 0.40905 P (Z ≥ -0.42) = P (Z ≤ 0.42) = 0.66276 P (0.12 < Z ≤ 0.82) = P(Z ≤ 0.82) – P(Z ≤ 0.12) = 0.79389 – 0.54776 = 0.24613 P (Z = 1) = 0 All probabilities in the table of ΙΈ(z) are greater than 0.5, but can still use the tales with values less than this. Just use this formula (1 – probability) ο· Finding Z values – using the table of percentage points The Table of percentage points is located on page 25 of the formula book You use the percentage table in a similar way as the normal distribution table, by starting with a probability and looking up a value for Z. EXAMPLE X ~ N (10, 0.5 2 ) Find z such that P(X < z ) = 0.96 – Look up 0.96 ΙΈ (z) = 0.96 , z = 1.7507 Finding the mean and variance To find one of these, you will need to form an equation as follows: Y ~ N(π, 4²) P (Z ≤ 23 − π 4 23 − π 4 P (Y ≤ 23) = 0.75 find π. ) = 0.75 = 0.6745 π = 20.302 ο· To find both of them, you will need to form a pair of simultaneous equations: X ~ N( π, π²) P ( X ≤ 1.83) = 0.3 P(Z≤ 1.83 − π ) π P ( X ≤ 2.31) = 0.7 P(Z≤ = 0.3 2.31 − π ) π = 0.7 This will be a negative ‘z’ value This will be a positive value 1.83 − π π 2.31 − π π = -0.5244 1.83 - π = -0.5244π (1) Eliminate either π or π = 0.5244 2.31 - π = 0.5244π (2) π = 2.07m π = 0.458 5) a) i) Weight, W ~ N(2.75, 0.152) P(W < 2.8) = P( π < 2.8−2.75 ) 0.15 P(Z < 0.33) = 0.629 The formula what we need to use is, π−π π = π§. Now collect the terms that we need, which are the ‘mean’(π)= 2.75, ‘Standard deviation’ (π) = 0.15 and our ‘X’ is 2.8. Since our weight is less than 2.8kg, we write the notation P(W < 2.8). This is then converted to π ( π < 2.8−2.75 ) as 0.15 we want to know the normal distribution of it, as you can see the terms we have gathered have been subbed into the correct places. 2.8−2.75 0.15 = 0.33, so now we do P(Z < 0.33) for this we can look in the tables on page 24 of the formula booklet. We go down to 0.3 and along to 0.03 in the tables which gives us the answer of 0.62930, then round to 3dp. By using a graphical calculator [MENU] [2] (STAT) [F5] (DIST) [F1] (NORM) [F2] (Ncd) A screen with a set of options should appear change to these settings Normal C.D Data: Variable Lower: -100 Upper: 2.8 π βΆ 0.15 π: 2.75 Save Res: None Execute [F1] (CALC) This gives us the P value of 0.63055837, round down up 0.631 This falls in the range of the answer in the mark scheme ii) P ( W > 2.5) = P( π > 2.5−2.75 ) 0.15 P(Z > -1.67) = P(Z < 1..67) = 0.952 The rules pretty much apply as they did in the last question. The only difference is that this time we’re dealing with a ‘more than sign’, nothing changes until after you sub your numbers in. You should get a value of -1.67, now since it’s a negative number we flip the sign to less than and get rid of the negative. Now we just treat it normally, look in the Normal Distribution tables on page 24 , to get your answer as 0.952 By using a graphical calculator [MENU] [2] (STAT) [F5] (DIST) [F1] (NORM) [F2] (Ncd) A screen with a set of options should appear change to these settings Normal C.D Data: Variable Lower: 2.5 Upper: 1E + 99 π βΆ 0.15 π: 2.75 Save Res: None Execute [F1] (CALC) This gives us the P value of 0.95220964, round down to 0.952 This falls in the range of the answer in the mark scheme b) Weight, X ~ N(5.25, 0.202) i) P(5.1 < X < 5.3) = P(Z < 0.25) – P(Z < -0.75) 0.59871 – ( 1 – 0.77337) = 0.372 For this we do the normal distribution of 5.3 – the normal distribution of 5.1, we’re combining our two methods worked out in the part a of this question. But we’re dealing with new numbers, again we use the same formula and sub the numbers in . For P(X < 5.3) P( π < P(X > 5.1), ) P( π < 5.1 −5.25 )= 0.2 5.3 −5.25 ) 0.2 = P(Z < 0.25) . For P(Z < -0.75) remember as it’s a negative we flip the sign to make it P(Z > 0.75). We still need to do another step, as Normal distribution tables on work when (X ≤ x), as it’s a more than sign remember we now do (1 – P(X > x)). When we look in the tables for the values we get 0.59871 – (1-0.77337) which is 0.372. By using a graphical calculator [MENU] [2] (STAT) [F5] (DIST) [F1] (NORM) [F2] (Ncd) A screen with a set of options should appear change to these settings Normal C.D Data: Variable Lower: 5.1 Upper: 5.3 π βΆ 0.2 π: 0.525 Save Res: None Execute [F1] (CALC) This gives us the P value of 0.37207897, round down to 0.372 This falls in the range of the answer in the mark scheme ii) P(0 in 4) = [1 – 0.372]4 = 0.6284 = 0.155 We don’t know how many bags of 5kg the store sells, so we do [1 – 0.372]4, as we’re taking away the random sample of 4 from the original population, we have the power of 4 because that’s how many bags we’re dealing with. We have to use the probability from part b)i) since the question ask to calculate the probability of 5.1 and 5.3. c) Weight, Y ~ N(10.75, 0.502 ) Variance of πΜ 6 = 0.52 /6 = 0.0416 P(πΜ 6 < 10.5) = P(π < 10.5−10.75 ) √0.0416 P (Z < - 1.22) = 1 – P(Z < 1.22) 1 – (0.88877) = 0.111 For this question, you need to know a little bit of estimation. To work out the sample mean, we do the standard error which is π √π or π2 π (This will be our new or variance or standard deviation) Variance of πΜ 6 = 0.5 /6 = 0.0416 this is our new variance. We can now treat the question normally for a normal distribution. We sub in the correct values into the formula; we get the negative number again. We have to flip the sign, making it a more than sign. With a more than sign we do 1 – P(X>x) , this means we do 1 – 0.88877 which is 0.111 2 Estimation Estimating the population mean from a sample mean and finding a confidence interval. In a simple random sample - Every person or item in the population has an equal chance of being in the sample - Each selection is independent of every other selection Statistics A statistic is a quantity calculated only from the known observation in a sample. A statistic is a random variable it takes different values for different samples. The probability distribution of a statistic is called a sampling distribution. Statistics are used to estimate population parameters, ‘population parameters’ are the populations characteristics e.g. mean (π) and variance (π) For example, you can use the sample mean π₯Μ = ∑π₯ π to estimate the population mean (π). The sample mean is called an unbiased estimator of the population mean. This means the expected value of the sample mean is the same as the population mean. Formulae for Unbiased Estimators of population variance and standard deviation Sample Variance = π 2 = π ∑ π₯2 [ π−1 π ∑π₯ 2 −( π ) ] Sample Standard Deviation = s = √π πππππ π£πππππππ EXAMPLE A random sample was taken from a population whose mean (π) and variance (π 2 ) are unknown. If the observed values were 5.6, 5.8, 5.5, 5.8 and 5.7, find the unbiased estimates of π and π 2 . The sample mean (π₯Μ ) will give an unbiased estimate of the population mean (π) ∑ π₯ = 28.4, π π π₯Μ = ∑π₯ π = 28.5 5 = 5.68 Unbiased estimate of the population variance (π 2 ) ∑ π₯ 2 = 161.38, π π π 2 = π ∑ π₯2 [ π−1 π ∑π₯ 2 −( π ) ] = π 2 = 5 161.38 28.4 2 − ( ) ] [ 4 5 5 = 0.017 The standard error and Central Limit Theorem A standard error is a measure of the reliability of a statistic π The standard error of πΜ is √π Suppose you’ve got a random variable π, with mean π and variance π 2 When you take a sample of n readings from the distribution of π, you can work out its sample mean πΜ . If you now keep taking samples of size n from that distribution and working out the sample means then you get a collection of sample means, drawn from the sampling distribution of Μ π This sampling distribution also has mean π, but a variance of standard error of the sample mean π2 , π this standard deviation is called the If X is normally distributed (X ~ N(π, π 2 )), then πΜ ~ π(π, π2 ) π The Central Limit Theorem The central limit theorem tells you something about πΜ , - even if you know nothing at all about the distribution of X Suppose you take a sample of n readings from any distribution with mean and variance For Large n, the distribution of the sample mean is πΜ , is approximately normal πΜ ~ π(π, π2 ) π The bigger n is the better the approximation will be (n > 30 it’s pretty good) A Confidence interval is a type of interval estimation A confidence interval for the mean of a population is a range of values that you’re fairly sure contains the true population mean. E.G – “(11.2, 14.4) is a 95% confidence interval for π” This means that there’s a 95% probability that the interval of (11.2, 14.4) includes the true value of π. A confidence interval extends “Z x Standard Error” from the sample mean. Confidence intervals for π – Normal distribution and known variance A confidence interval for the population mean is (πΜ − π§ π , √π πΜ + π§ π ) √π The value of Z depends on the level of confidence you need for example For a 95% confidence interval, choose z so that P(-z < Z < z) = 0.95 For a 98% confidence interval, choose z so that P(-z < Z < z) = 0.98, you get the picture. 7) a) π₯Μ = 18.1 36 = 5.05, 98% (0.98) z = 2.32 Cl for π ππ π₯Μ ± π§ × π √π , 5.05 ± 2.3263 × 0.075 √36 5.05 ± 0.03 (or 5.05, 5.08) We want to find the mean first so we divide the total volume of 181.80 by the total number of measurements which is 36. π₯Μ = 18.1 36 = 5.05, For the confidence limit of 98% we look at page 25 in the formula booklet, you have to take the limit of 99% which is 2.3263 as this will cover the 98% confidence limits and all decimal places within. We now need to multiply by the standard error to measure the reliability. Confidence limit for ππ π₯Μ ± π§ × 0.075 ) √36 π √π , sub in the values into the correct position 5.05 ± (2.3263 × Which gives you a final answer of 5.05 ± 0.03 Or by using a graphical calculator [MENU] [2] (STAT) [F4] (INTR) [F1] (Z) [F1] (1-S) It now should come up with a list of data variables that need to be entered. Your screen should end up like this 1 – Sample ZInterval Data : Variable C – Level : 0.98 π βΆ 0.075 π₯Μ : 5.05 π βΆ 36 Save Res: None Execute [F1] (CALC) A screen should appear with this on 1 – Sample ZInterval Left = 5.02092065 Right = 5.07907935 π₯Μ : 5.05 π βΆ 36 This leads on to our answer of (5.02, 5.08) taking the information gathered from the left and right. You write the answer as coordinate, as this is where the limit stands. b) Claim is π > 5L, Confidence limits exceed 5, so the first claim is correct. 8/36, 8 bottles out of 36 have less than 5 litres, the claim is fewer than 10%, 8/36 ≠10%, therefore we disagree with the second claim. c) Yes because volumes are not stated as normally distributed We only use the Central Limit Theorem if original population is not normally distributed, assuming that the value of n Is greater than 30, in which case it is. Therefore we did make use of the Central Limit Theorem as n > 30, therefore the original population is not normally distributed. Correlation and Regression Correlation Correlation is a measure of how closely variables are linked. Draw a scatter diagram to see pattern in data. Positive correlation has a positive gradient. As one goes up, so does the other variable. Negative correlation has negative gradient. As one group goes up, the other variable decreases. No correlation, scattered points, no line of best fit. As the two variables aren’t linked at all, you’d expect a random scattering. To measure correlation, we use the ‘The Product-Moment Correlation’ The Product-Moment Correlation Coefficient (In Short PMCC, or , ‘r’) Measures how close to a straight line the points on a scatter graph lie. PMCC is always between +1 and -1 r = + 1, if all your points lie exactly on a straight line with a positive correlation. If r is between ± 0.2 and 0 means there is a weak correlation If r is between the values of ±0.2 and ±0.7 there is a moderate correlation If r is between the values of ±0.7 and ±0.9 there is a strong correlation r = -1, if all your points lie exactly on a straight line with a negative correlation. You will never expect to get a PMCC of +1 or -1, your scatter points may lie pretty close but unlikely to be all on it. If you cannot visualise the correlation by the use of values obtained, create a scatter graph to help you describe the correlation. Always remember when describing the correlation refer to your variables e.g. there is some positive correlation between the height and the weight of a person. To work out correlation you would use these formulae, to know how to get the correct parts from your data set. Refer to pages 6 and 7. EXAMPLE Illustrate the following data with a scatter diagram, and find the product-moment correlation coefficient (r) between the variables x and y. If p= 4x – 3 and q = 9y + 17, what is the PMCC between p and q? X Y 1.6 11.4 2.0 2.1 2.1 2.5 2.8 2.9 11.8 11.5 12.2 12.5 12.0 12.9 3.3 13.4 3.4 3.8 4.1 4.4 12.8 13.4 14.2 14.3 16 14 Just plot the points for the scatter graph 12 Now for the correlation coefficient. From the scatter diagram, the points lie pretty close to a straight line with a positive gradient – so if the correlation doesn’t come out pretty close to +1, we’d not need to worry. 10 8 6 4 2 0 0 1 2 3 4 5 There are 12 pairs of readings, n = 12. Now you have to work out a load of sums, if you working it out manually add extra rows to your table. But you can work parts using a calculator. x y π₯2 π¦2 xy 1.6 2.0 2.1 2.1 2.5 2.8 2.9 3.3 3.4 3.8 11.4 11.8 11.5 12.2 12.5 12.0 12.9 13.4 12.8 13.4 2.56 4 4.41 4.41 6.25 7.84 8.41 10.89 11.56 14.44 129.96 139.24 132.25 148.84 156.25 144 166.41 179.56 163.84 179.56 18.24 23.6 24.15 25.62 31.25 33.6 37.41 44.22 43.52 50.92 Note: Answers in blue are the sums of each row, which are needed for the PMCC 4.1 14.2 16.81 201.64 58.22 4.4 14.3 19.36 204.49 62.92 Stick all these in the formula to get π= [453.67 − √[110.94 − = 9.17 √8.857 ×10.56 35 × 152.4 ] 12 352 152.42 × − ] [1946.04 12 12 ] = 0.948 Correlation coefficients aren’t greatly affected by linear scaling – you can multiply either variables by positive numbers (or multiply them both by negative numbers) and add numbers to them, and you won’t change the PMCC between them. So if p and q are given by p = 4x – 3 and q = 9y + 17 , then the PMCC between p and q is also 0.948 35 152.4 110.94 1946.04 453.67 1) a) r = ππ₯π¦ √ππ₯π₯ ×π π¦π¦ = −0.410 = √2.030 ×1.498 -0.235 The question asks you to find the value of the product moment correlation coefficient, all you need to do is look up the equation in the formula booklet. Then sub in your values into the correct places in the formula. Use a calculator to get your answer of -0.235. b) There is a weak correlation between width and thickness of lengths of steel Since the r value is very close to ± 0.2, then the correlation is a weak one. To get the full marks you must refer to the two variables. Linear Regression Linear Regression is in the simplest of words, finding the line of best fit. Decide which the independent variable is and which the dependent is. The independent variable is the variable along the x-axis, it is the variable you can control or the one you think is affecting the other. Whilst the dependent variable, goes up the y-axis, it is the response variable the one you think is being affected. The equation for the regression line is y = a + bx , Where ‘a’ is the intercept, and ‘b’ is the gradient. Before plotting a line of best fit, you must work out at least two coordinates that lie on the line by substituting in value of x Linear Scaling is a linear transformation or scaling of one or both of the variables. It doesn’t affect the ‘r’ value. A Residual is the difference between an observed y-value and the y-value predicted by the regression line. Residual = Observed y-value – Estimated y-value 1) Residuals show the experimental error between the y-values that’s observed and the y-value your regression line says it should be 2) Residuals are shown by a vertical from the actual point to the regression line. 3) You would like your residuals to be small, as this would show your regression line fits the data well. If they’re large then your regression line on your model won’t be a very reliable one Outliers have a big effect on the regression line; they are able to drag the line of best fit away from the rest of the data values. When you indicate an outlier, you should try to explain it, e.g. it could be an error that you are told to correct. EXAMPLE Find the equation of the regression line of y on x for the data below, the data shows the load on a lorry, π₯ (in tonnes), and the fuel efficiency, π¦ (in km per litre) π₯ 5.1 5.6 5.9 6.3 6.8 7.4 7.8 8.5 9.1 9.8 72.3 = ∑ π₯ π¦ π₯2 9.6 26.01 9.5 31.36 8.6 34.81 8 39.69 7.8 46.24 6.8 54.76 6.7 60.84 6 72.25 5.4 82.81 5.4 96.04 π₯π¦ 48.96 53.2 50.74 50.4 52.04 50.32 52.26 51 49.14 52.92 73.8 = ∑ π¦ 544.81 = ∑ π₯2 511.98 = ∑ π₯π¦ Then work out ππ₯π¦ , is given by ππ₯π¦ = ∑(π₯ − π₯Μ )(π¦ − π¦Μ )= ∑ π₯ − 2 2 and ππ₯π₯, is given by: ππ₯π₯ =∑(π₯ − π₯Μ ) = ∑ π₯ − (∑ π₯)(∑ π¦) π (∑ π₯)2 π ππ₯π¦ The gradient (b) of your regression line is given by b = π and the intercept (a) is given by: a = π¦Μ - bπ₯Μ π₯π₯ 1) Work out the sums: ∑ π₯ = 72.3 , ∑ π¦ = 73.8 , ∑ π₯ 2 = 544.81, ∑ π₯π¦ = 511.98 2) Then work out ππ₯π¦ and ππ₯π₯ βΆ ππ₯π¦ = 511.98 − ππ₯π₯ = 544.81 − 72.32 10 72.3 ×73.8 10 = −21.594 = 22.081 3) So the gradient of the regression line is: b = 4) And the intercept is: π = ∑π¦ π −π ∑π₯ π = −21.594 = 22.081 73.8 − 10 -0.978 (−0.978) × 72.3 10 = 14.451 = 14.5 5) This all means that your regression line is: π¦ = 14.5 − 0.978π₯ This tells you: (i) for every extra tonne carried, you’d expect the lorry’s fuel efficiency to fall by 0.978km per litre, and (ii) with no load (x = 0), you’d expect the lorry to do 14.5kkm per litre of fuel. 3) a) b (gradient) = 2.27 a (intercept) = 4.17 To get these answers, use a calculator. Input the data values on your calculator, and then follow the step by step on page 6. Double check that your values are correct, then print your values on the page. ‘A’ = 4.16981131 and ‘B’ = ‘2.27’ b) Correct Straight line (You will get a graph for this question, so you don’t need to draw one out) 200 180 160 140 120 100 80 60 40 20 0 0 20 40 60 80 100 c) i) Correct Straight line It will look different to the line above, as it’s a different regression line. Work out the first and last coordinate. Sub x = 10 into equation, π¦π = 60.1 + 0.255(10) = 62.65 First Coordinate = (10, 62.65) then sub x = 80 into equation, π¦π = 60.1 + 0.255(80) = 80.5. Last coordinate = (80, 80.5) The mark scheme is lenient of your coordinates. ii) 27 to 29 The mark scheme is lenient on this as well, not everyone’s line of best fit on part i) will be the same. But all you do is look at y = 100, then go across to the regression line and read the X coordinate of it. iii) At low temperatures more B (than A) Dissolves or At high temperatures more A dissolves than B. Amount increases more rapidly for A than B or Amount increases more slowly for B than A