1. Drug A is a new drug created to treat Congenital Amegakaryocytic Thrombocytopenia (CAMT). An experiment was done to evaluate survival outcomes of drug A, against the current standard, Rituximab, for treating CAMT. 150 CAMT patients were assigned to the drug A group, and 150 CAMT patients were assigned to the Rituximab group. A portion of the experimental design is summarised in the table below. Male Female Total Drug A 82 68 150 Rituximab 85 65 150 Determine if the statement below is true or false: “The table shows that random assignment was not done, because the two groups (drug A and Rituximab) do not have the same number of females.” The statement is false. Random assignment does not guarantee that the number of females will be exactly the same in both groups (drug A and Rituximab). If random assignment was done on a large number of subjects, the two groups will tend to be similar (not necessarily the same) in all aspeccts. 2. In a study on the relationship between watching television and obesity, 3000 individuals were recruited via an advertisement put up by ABC newspaper in Singapore. Afterwards, the investigator of the study separated the participants into two groups. The participants in one group consists of people who, on average, watch television for at least 4 hours a day, while the other group consists of people who, on average watch television strictly less than 4 hours a day. The investigator then records how many participants are obese in each group. Which of the following is/are true? (I) The result of this study is generalisable to the population who reads ABC newspaper. (II) This is an observational study. (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). Answer is (B). The result is not generalisable because the investigator uses nonprobability sampling (volunteer sampling). The study is clearly an observational study because the researchers are not involved in the assignment of the subjects to either group, but rather is done by self assignment. 3. In a large scale experiment, a researcher randomly assigned 6000 subjects to receive either a drug or a placebo. 4000 patients were assigned to receive the drug, and the other 2000 patients received the placebo. The researcher did a quick headcount in the drug-receiving group and noted that there were 3002 males who received the drug. The researcher does not have time to do a headcount in the placebo group. Which is the most reasonable number of males to be expected in the placebo group? (A) 1000. (B) 1500. (C) 2000. (D) 2998. 1 Answer is (B). Randomised assignment of a large number of subjects tend to produce groups which are similar in all aspects (including the proportion of males in each group). 6000 is a reasonably large number, so we would expect the proportion of males and females in each group to be similar. Among the 4000 subjects who received the drug, 3002 (about 75%) were males. Hence among the 2000 patients who received the placebo, about 75% (1500) of them should be male. 4. Virus X has been known to cause very severe symptoms in its patients. Previously there has been no anti-viral medicine to treat virus X. Recently, researchers have finally managed to produce a trial drug in the form of a tablet. Researchers want to investigate if the trial drug helps to reduce the duration of symptoms (number of days) in patients. 1000 patients were sampled for the study, and all consented to join the study. Which of the following statements is/are true? Select all that apply. (A) Random sampling should be done to ensure that the subjects’ demographics/characteristics are similar (in the treatment and control groups). (B) Blinding the researchers to the subjects’ assigned groups (treatment or control group) is important because the researchers may have certain bias for/against the drug. (C) If the study randomly assigns 400 subjects into the treatment group, and 600 subjects into the control group, the result of the study will be biased due to the unequal number in the two groups. Only (B) is correct. Random assignment (not random sampling) should be done to ensure that the subjects’ demographics/characteristics are similar in the treatment and control groups. Furthermore, random assignment does not require the same number of subjects in both treatment and control groups. As long as the number of subjects is large, the treatment and control groups will likely have similar demographics/characteristics. In fact, since we are comparing rates and not numbers, it does not matter if we have unequal group sizes. 5. A researcher has invited 500 people to participate in his study. He uses random assignment to assign the subjects into the Treatment and Control groups. The Treatment group has 200 subjects, and the demographics of the 200 subjects are as follows: Old Young Male 83 32 Female 18 67 The 300 remaining subjects are in the Control group. The researcher should expect . the number of young males in the Control group to be around (A) 32. (B) 48. (C) 51. (D) 173. Answer is (B). When random assignment is conducted, we can expect the percentage of males to be similar in both Treatment and Control grpups. We can also expect the percentage of young people to be similar in both Treatment and Control groups. The 32 proportion of young males in the Treatment group is 200 = 0.16. We can expect a similar proportion in the Control group, implying that we expect about 0.16×300 = 48 young males in the Control group. 2 6. Xiao Lian is a frequent TikTok user, and she wants to compare the effect of watching TikTok dance videos versus watching maths videos on the happiness levels of all NUS students. She video calls 200 of her closest NUS friends to take part in her study, but only 100 of her friends responded. Half of these 100 friends were chosen at random to watch a TikTok dance video while the other half were shown a maths video, and she asks them to rate their happiness levels before and after watching the videos. Which of the following statements is true? (A) This is an observational study. (B) This study is likely to contain selection bias. (C) If all 200 of her friends responded, the study’s findings are generalisable to all NUS students. (D) Random assignment was conducted, so the study does not contain selection bias. Answer is (B). This study is likely to contain both selection bias and non-response bias. It is likely to have selection bias because Xiao Lian chose from only her friends, and not the entire NUS population of students. For example, Xiao Lian might have more friends in the same faculty as her as compared to other faculties, or they might have the same interests (TikTok) as her. The study is likely to have non-response bias because 100 of them did not respond. Random assignment is a technique used to remove the effects of potential confounding variables. Random assignment will not remove selection bias in a study. 7. A study was conducted to investigate if memory foam pillows helped to improve one’s sleep quality. A large number of subjects were sampled via simple random sampling. The study aimed to randomly assign the subjects into the treatment and control groups. Thus, a fair coin was flipped for each subject to determine if the subject should be assigned into the treatment or control group – if “heads”, the subject is assigned into the treatment group, if “tails”, the subject is assigned into the control group. Subjects in the treatment group received a memory foam pillow, while subjects in the control group received a regular pillow. Below is the description of the number of subjects between the two groups. Sex Male Female Total Treatment q t w Control r u x Total s v y From the above table, which of the following statements is true? (A) wq should be approximately equal to xr . (B) q must be equal to r. (C) s should be approximately equal to v. (D) w must be equal to x. Answer is (A) because we expect that if random assignment was done and the coin is fair, we would see a similar proportion (may not be the exact proportion) of males between both treatment and control groups. None of the other statements results from random assignment using a fair coin or from simple random sampling. 8. Which of the following statements is true regarding observational studies and experimental studies? 3 (A) Observational studies involve manipulating variables to establish cause-and-effect relationships. (B) Experimental studies are only used to study rare events or long-term trends. (C) Observational studies rely on random assignment to groups to control for confounding variables. (D) Observational studies can establish associations. Answer is (D). We are not able to manipulate variables in observational studies. The use of experimental studies is not restricted to the study of rare events nor long-term trends. Random assignment is not feasible in observational studies. 9. Patch Z is a new medicine created to remove muscle soreness. A study was done to investigate the effectiveness of patch Z. The population of interest was Singaporean adult males. For this study, the researchers requested the Singapore Sports Association to sample all male athletes who reported for training over the week. 200 male athletes were sampled. There was no non-response. The 200 subjects had their identity tags randomly shuffled in a box. The first 100 tags picked from the box were assigned to the treatment group - administered patch Z. The remaining 100 were administered a placebo. 72% of the group that received patch Z had their muscle soreness alleviated, while 34% of the other group had their soreness alleviated. Which of the following is/are true? (I) We are not able to generalise these results to the population of interest. (II) Random assignment was not conducted. (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). Answer is (A). The result cannot be generalised, because of the sampling method and the fact that the sampling frame does not contain the population of interest. The method is non-probability sampling, and the sampling frame contains only male athletes, or those who went for training that week, but the population of interest is all Singaporean adult males. Take note that random assignment was actually conducted. 10. A researcher is trying to study the happiness level of all current NUS students. Which of the following is/are (an) example(s) of probability sampling methods? (I) The researcher gets a list of all current NUS students’ emails from the administrative office and randomly selects 100 students’ emails. He then emails them a link to a short e-survey. 50 students replied to his survey. (II) The researcher invites all final year NUS students in his faculty to visit his lab for a psychological test to determine their happiness level, with a promise to compensate them for their time with a $10 voucher. 200 students turned up for the test. (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). 4 Answer is (A). Randomly selecting 100 emails from the list of student emails is a probability sampling method, even if the response rate is not high. On the other hand, the process of sending an email to only final year students in the researchers’ faculty is done out of convenience, and convenience sampling is a non-probability sampling method. 11. To find out the employment status of fresh graduates from University ABC, a questionnaire was sent to all of them. 30% of fresh graduates responded to the survey. The employment rate was calculated from the responses. Which of the following is likely to cause the calculated rate to differ from the population rate? (I) Selection bias. (II) Non-response bias. (A) Only (I). (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (B). 30% of fresh graduates may not be representative of the whole population. It is possible that those who did not respond may have different employment status from those who responded. Although the response rate is low, it is still a census. There is no selection bias in a census. 12. Suppose I wish to find the average intelligence quotient (IQ) of all Primary 5 children studying in local schools in Singapore. I first selected a random sample of 10 schools out of all local primary schools in Singapore. Then I asked all the Primary 5 children in these chosen 10 schools to take an IQ test. Finally, I obtained the average value of all the IQ scores of children who took the test, which was 106. Which of the following statements is/are correct? (I) The parameter in this study is the average IQ of all Primary 5 children who took the IQ test. (II) Stratified sampling was employed in this study. (A) Only (I). (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (D). The parameter in this study is the average IQ of all Primary 5 children studying in local schools in Singapore. 106 is a sample estimate of the actual parameter. Hence statement (I) is incorrect. In stratified sampling, the population is divided into groups (strata) and then we randomly obtain a sample from each group. In cluster sampling, the population is first divided into groups (clusters). Then we take a random selection of clusters from all clusters, and include all units in the chosen clusters to comprise our sample. Here, cluster sampling is employed, where each school is a cluster. Hence statement (II) is also incorrect. 13. An airline would like to find out the quality of service provided by their staff on one of its flights. The airline posted an interviewer right after the plane landed and told the interviewer to start her interview when the third passenger came out from the plane. Thereafter, she would interview the next customer who came out from the 5 plane each time she had finished interviewing the previous customer. What is the sampling method employed by the interviewer? (A) Systematic sampling. (B) Simple random sampling. (C) Cluster sampling. (D) None of the other given options. Answer is (D). This sampling method is not a probability sampling method since there is no probability involved in selecting passengers to be interviewed. All the methods given in the other options are probability sampling methods. 14. In a drug factory, pills were manufactured in 1000 batches, with 20 units per batch, forming a total of 20000 units. You decide to sample some of these pills to ensure that the dosage is right. Suppose you randomly sample two batches and then select every unit in these batches to be in your sample. What sampling method did you use? (A) Systematic sampling. (B) Stratified sampling. (C) Cluster sampling. (D) Simple random sampling. Answer is (C). Since every unit from the randomly selected batches (the clusters) are included in the sample, this is an example of the cluster sampling method. 15. Assume you have a sampling frame of your entire population of interest, which comprises of 100 people’s names. Which of the following methods can be used to select a simple random sample of 10 people from this population? (I) Assign, without replacement, to each person a random number from 1 to 100 such that none of them share the same number. Choose the people assigned numbers 1 to 10. (II) Write the names on equal sized pieces of paper, put the papers in a hat. Shake the hat, mix the papers well and draw out 10 names. (A) Only (I). (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (C). In simple random sampling, every unit in the population of interest must have the same chance of being selected in the sample. 16. Caleb resides in Stamford Hall, which comprises 400 university students. He wishes to understand the association between Cumulative Average Point (CAP) and sleeping hours among students residing there. There are a total of 100 rooms, with rooms numbered 1 to 100, and each having 4 resident students. Rooms 1 to 50 contain male residents, while rooms 51 to 100 contain female residents. Caleb picks all rooms that are multiples of 5 (i.e., rooms 5, 10, 15, . . ., to 100), and all students in the selected rooms were asked to do the study. Which of the following must be true? (I) The above is an example of systematic sampling. 6 (II) The response rate for the study is 100%. (A) Only (I). (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (D). It is not stated explicitly that the selection of the rooms is done via a randomised mechanism. Therefore, the sampling process is non-probability in nature. Also, it is not stated explicitly if all who were asked to do the study will respond. Therefore, the response rate for the study may not necessarily be 100%. 17. A researcher wants to know the average weight of year 1 students in University A. The researcher does not have access to such information, hence he decided to do a survey. All University A year 1 students have to take a compulsory module in the first semester of their studies, and hence have to be present for an in-person examination on 24th April at 1pm. The researcher stood outside the examination venue’s only exit with a weighing scale and waited for the examination to end. There were too many students for the researcher to weigh. Hence, to decide whom to weigh, while students were exiting, the researcher used a random integer generator to produce a random integer for each student. If the random integer was even, the researcher will measure the student’s weight. If the random integer was odd, the researcher will not measure the student’s weight. Assume that all students exited the venue orderly in a line and were compliant with the researcher. There were 800 students exiting the venue, and the random integer generator produced 200 even numbers and thus 200 students were weighed. What is an/are issue(s) that the study is likely to face? Select all that apply. (A) Bias present due to the non-probability sampling method. (B) Bias present due to the random integer generator producing unlikely random numbers. (C) Bias present due to a poor sampling frame chosen. (D) None of the other options. Answer is (D). The study uses a form of probability sampling, where every unit has a non-zero and known probability of being selected. Also, the random integer generator is not always expected to produce 50% odd numbers and 50% even numbers, as it depends on the settings. The population of interest matches the sampling frame for this situation - the year 1 students. 18. A class of 150 men and 250 women is seated in an examination hall that has 25 rows of chairs, with 16 chairs in each row. The men are seated in chairs numbered 1 to 150, and the women are seated in chairs numbered 151 to 400. Which of the following scenarios will NOT produce a simple random sample of students from this class of 400? Select all that apply. (A) Choose a row at random and select all the students in that row. (B) Use a random number generator to generate 10 integers from 1 to 400. Select the students seated in chairs corresponding to these numbers. (C) Use a random number generator to generate 10 integers from 1 to 150, and 10 integers from 151 to 400. Select the students seated in chairs corresponding to these numbers. 7 (D) Randomly select a letter from the English alphabet. Select for the sample students whose family names begin with that letter. If no family name begins with that letter, randomly choose another letter from the alphabet. (A), (C) and (D) will not produce a simple random sample of students. Recall that a SRS is a sample such that any sample of size n is equally likely to be chosen. (A) does not produce a SRS, since not every group of size 16 has the same chance of being selected. For example, students from different rows have no chance of being selected together. (B) produces a SRS, since every group of size 10 has the same chance of being selected. (C) does not produce a SRS, since not every group of size 20 has the same chance of being selected. For example, a group of 20 women cannot all be in the sample. This is actually a stratified sample, with sex as strata. (D) does not produce a SRS, since any two students with family names starting with different letters cannot both be in the sample. 19. Which of the following scenarios involve(s) the use of probability in the sampling process? Select all that apply. (A) A student wants to know how receptive bus commuters in Singapore are to the recent announcement of a price increase. To obtain a sample, he went to a nearby bus interchange and looked for commuters wearing white clothing (his favourite colour). For each commuter wearing white, if the commuter was wearing spectacles, he approached the commuter. Otherwise, he did not approach the commuter. We may assume that every commuter that he approached completed the survey. (B) An event organiser wants to survey a sample of the participants of his event to find out if they liked the activities he planned. He announced to all the participants that anyone who completed the survey would win a prize with a probability of 0.8. The sample was made up of 75 participants who responded to the survey. (C) The principal of a primary school wants to select a sample of his primary one students to find out how they are coping with formal school education. There are 10 primary one classes and each class has 42 students. The principal went into each class and rolled a six-sided die. If the die showed k (k = 1, 2, 3, 4, 5 or 6), then students in the class with register numbers k, k + 6, k + 12, k + 18, k + 24, k + 30 and k + 36 would be included in the sample. A final sample of 70 students, 7 from each class, was formed. (D) A researcher is trying to study how much time students from the Faculty of Science sleep every day. He obtained a list of all Science students from the administration and for each student on the list, he generated a random number from 1 to 10. If the number generated was less than 7, the student was not selected. If the number was 7 or more, the student was selected, and the researcher sent an email to the student with a few survey questions. A total of 700 students received the survey email and 500 of them replied. (C) and (D) are correct as both employ a randomised mechanism in the selection process. (B) is non-probability sampling (volunteer sampling) while (A) is also nonprobability sampling as the student conveniently went to a nearby bus interchange (convenient sampling). 20. The Registry of Marriages is interested to see the relationship between the ages of husbands and wives in City X. They randomly sampled 1000 pairs of husbands and wives from the population of City X and obtained data of their ages (in years). Looking through the data, they found that men always marry women who are younger than 8 them. Based only on the information given above, which of the following statements must be true? (I) The average age of the husbands is more than the average age of the wives. (II) The standard deviation of husband’s age is more than the standard deviation of wife’s age. (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). Answer is (A). We cannot tell anything about the spread of either the husbands’ ages or the wives’ ages just by knowing that men are marrying women younger than them. If all husbands are older than their wives, then it follows that the average age of the husbands is going to be more than the average age of the wives. 21. A teacher has just finished marking the final examination scripts for her class of 50 Secondary 1 students. She informs the students that the class average is 67.3. The maximum mark for the examination is 100 and the passing mark is 50. A student receives his examination script and realises his score is 65 which is lower than the average score. Based only on the information given above, which of the following statements must be true? (I) The student has performed worse than half the class. (II) Everyone in the class has passed the test. (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). Answer is (C). A score lower than the mean does not imply that the student has performed worse than half the class. Consider the following set of scores for 10 students 45, 55, 60, 62, 64, 64, 65, 85, 86, 87. The average is 67.3. The student who has scored 65 clearly has not performed worse than half the class. Neither has everyone in the class passed the test since one student has scored 45. A similar data set can also be constructed for 50 students. Therefore, neither statement is true. 22. We have learnt that the standard deviation and interquartile range (IQR) are examples of summary statistics that help us to quantify the spread of data points. However, they are not the only ways of quantifying spread and there are other summary statistics that can also help us to do this. For a numerical variable x, we can define the Mean Absolute Deviation (commonly abbreviated as MAD) using the formula |x1 − x| + |x2 − x| + · · · + |xn − x| , n where x1 , x2 , . . . , xn are values for the variable in a data set and n is the number of data points in the data set. The MAD is sometimes used in place of the standard deviation as a measure of quantifying the spread of the data. Based on the above formula, which properties must the MAD possess? Select all that apply. Mean Absolute Deviation of x = 9 (A) The MAD cannot take a negative value. (B) The MAD does not change when a constant is added to all the data points. (C) The MAD does not change when a constant is multiplied to all the data points. (D) If the MAD is zero, then all the values of x1 , x2 , . . . , xn in the data set are the same. (A), (B) and (D) are correct. Based on the formula above, the MAD behaves very similarly to the standard deviation. Since we are taking absolute values, the MAD can never be negative. If a constant is added to all the data points, then the constant is also added to the mean of the new data therefore the absolute difference between each point and the mean continues to remain the same. When we multiply a constant to all the data points, the mean is also multiplied by the same constant, therefore the difference between each point and the mean is also multiplied by the same constant. Hence, the MAD is multiplied by the constant. The constant can be numbers other than 1 or −1, so the MAD can change. Finally, if the MAD of x is zero, it means |x1 − x| + |x2 − x| + · · · + |xn − x| = 0, n which means |x1 − x| + |x2 − x| + · · · + |xn − x| = 0. Since the absolute value of any number cannot be negative, we are adding numbers which cannot be negative to give us 0. Therefore, each number can only be zero which means every data point is equal to the mean. 23. Suppose X is a numerical variable and the following are 10 data points for this variable. 4, 7, 4, 14, 10, 11, 17, 3, 8, r, where r is a positive whole number that is unknown. Which of the following statements is/are always correct? Select all that apply. (A) If the mean is greater than 8 then r must be greater than 2. (B) If r is greater than 2, then the median must be greater than 8. (C) The mean is always greater than the median regardless of the value of r. (D) The mode is always greater than the median regardless of the value of r. Only (A) is correct. The sum of the 10 data points is 4 + 7 + 4 + 14 + 10 + 11 + 17 + 3 + 8 + r = 78 + r. If the mean is greater than 8, then 78 + r must be greater than 80, which implies r must be greater than 2. Arranging the 9 numbers excluding r in increasing order, we have 3, 4, 4, 7, 8, 10, 11, 14, 17. Note that the median is a number m where 50% of the numbers are smaller than m. For example, if r = 3, the median is 7.5 thus (B) is incorrect. From above, since r is at least 1, the mean is at least 7.9. But if r = 10 for example, then the median is 9 which is higher than the mean. So (C) is incorrect. (D) is also incorrect since if r = 4, the mode is 4 while the median is 7.5. 10 24. Consider a data set consisting of values for a numerical variable x. Let the values be x1 , x2 , . . . , xn arranged in ascending order. A value y is said to be the balancing point of x in the data set if the following condition is satisfied. (y − x1 ) + (y − x2 ) + · · · + (y − xk ) = (xk+1 − y) + (xk+2 − y) + · · · + (xn − y) where x1 , x2 , . . . , xk are the values of x in the data set that are smaller than or equal to y and xk+1 , xk+2 , . . . , xn are the values of x in the data set that are larger than y. For example consider a small data set {1, 3, 5, 5, 5, 7, 9}. In this case the value 5 is the balancing point of the data set since (5 − 5) + (5 − 5) + (5 − 5) + (5 − 3) + (5 − 1) = (7 − 5) + (9 − 5). Which of the two statements below is/are true? (I) The median of x is always the balancing point of x in any data set. (II) The mode of x is always the balancing point of x in any data set. (A) Only (I). (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (D). Neither median nor mode is always the balancing point of a data set. Consider a small data set {1, 1, 3, 4, 5}. The median of the data set is 3. However, we observe that (3 − 1) + (3 − 1) is not the same as (4 − 3) + (5 − 3). Similarly, the mode is 1 but once again we see that (1−1)+(1−1) is not the same as (3−1)+(4−1)+(5−1). 25. City planners wanted to know how many people lived in a typical housing unit so they compiled data from hundreds of forms that had been submitted in various city offices. Summary statistics are shown in the table below. Mean 2.53 Standard Deviation 1.4 Min 1 Q1 1 Median 2 Q3 3 Max 8 The city bases their garbage disposal fee on the occupancy level of the home or apartment. The annual fee is $50 plus $4 per person, so a single-occupant home pays $54 and homes with 10 people pay $50 + $4 ×10 =$90 a year. The median fee paid is (2) (1) and the IQR of the fee paid is . Fill in the blanks for the statement above, give your answers correct to 2 decimal places. The answer to the first blank is 50 + 2 × 4 = 58 since the median number of occupants is 2. The IQR of the fee paid is (50 + 3 × 4) − (50 + 1 × 4) = 8. 26. An examination was given to Class A and Class B, which consisted of 20 students each. The minimum possible score for the examination is 0 and the maximum possible score is 100. The range of scores in Class A is from 75 to 95 marks. All the students in Class B scored less than or equal to 50 marks. Due to shortage of teachers, Class A and Class B were combined to form Class C. Consider the following statements: (I) The median score in Class C must be greater than the median score in Class B. 11 (II) The interquartile range of scores in Class C must be greater than the interquartile range of scores in Class A. Which of the statements is/are true? (A) Only (I). (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (C). Since Class A and Class B have the same number of students, and all students in Class A scored strictly greater than the maximum score of Class B, the median for Class C will be greater than the maximum score of Class B. Hence the median of Class C will be greater than the median of Class B. The largest possible interquartile range for Class A is 95 − 75 = 20. The smallest possible interquartile range for Class C is 75 − 50 = 25. Hence the interquartile range for Class C must be greater than the interquartile range for Class A. 27. Which of the following statements is true? (A) The 3rd quartile is a measure of central tendency. (B) The 2nd quartile is a measure of central tendency. (C) The 3rd quartile is a measure of dispersion. (D) The 2nd quartile is a measure of dispersion. Answer is (B). The 2nd quartile is the median, which is a measure of central tendency. The 3rd quartile is neither a measure of central tendency nor a measure of dispersion. 28. A multiple-choice mid-term examination was conducted for 2000 students in a General Education Module GEB1000. There were 20 questions. Students were awarded 1 mark for each correct answer and received 0 mark for any wrong answer. There was no partial credit awarded for all questions. A teaching assistant helped with the collation of the scores of the paper, and provided the following summary statistics: Minimum = 2.0. 1st quartile = 7.5. Median = 11.5. Mean = 9.0. Mode = 12.0. 3rd quartile = 13.2. Maximum = 20.0. Which of the following statements is/are true? Select all that apply. (A) The 3rd quartile is incorrect. (B) Based on the above information, we can conclude that the range is 18.0. (C) Based on the above information, we can conclude that the coefficient of variation is 2. (D) None of the other statements is true. 12 (A) and (B) are correct. (A) is correct since the score of the exam is between 0 and 20 inclusive, with no partial credit given for any question, it is not possible for the 3rd quartile to be 13.2. It is worth noting though, that it is possible for the 1st quartile, median or 3rd quartile to be of the form x + 0.5, where x is an integer. For example, consider the following values for simplicity: 1, 2, 3, 4, 5, 7, 12, 13, 15. To derive the 3rd quartile, we look at the values above the median. That is, 7, 12, 13 and 15. The median of these four values is the average of 12 and 13, which is 12.5. (B) is correct as the range is the difference between the minimum and maximum values, which is 20.0 − 2.0 = 18.0. (C) is incorrect. The coefficient of variation (CV) is Standard Deviation divided by Mean. Since we do not have any information about the standard deviation, we are unable to determine the value of CV. 29. There are 20 students in each of the groups: A and B. All the students in both groups took a test. The minimum possible score is 0 and the maximum attainable score is 50. The range of scores obtained from the students in group A was 28 to 40. All the students in group B scored at least 42 marks. Let M be the average value of the combined scores of all the 40 students. Which of the following statements must be true? (A) M is equal to the average score in group B. (B) M is higher than the average score in group A. (C) M is lower than the average score in group A. (D) There is insufficient information to deduce the relationship between M and the average score in group A or group B. Answer is (B). Since group A and group B have the same number of students, and all the students in group A scored strictly lower than the lowest score in group B, it implies that the average for group A is strictly lower than that for group B. Therefore, when the scores of both groups are combined, the overall average score will be the ‘sum of the average score in group A and the average score in group B’ divided by 2, which will be strictly greater than the average score in group A and strictly lower than the average score in group B. 30. A study was done to determine the relationship between perceived obstetric violence and the risk of postpartum depression (PPD). A total of 782 women were asked to report on their baseline characteristics including the following: Age, Education level (Secondary School/ Pre-university / University), Family monthly wage (Euros), and Nationality (Spanish / Portuguese / French / Others). Which of the following statements is true about the types of variables of Age, Education level, Family monthly wage and Nationality, respectively? (A) Numerical, Categorical ordinal, Numerical, Categorical ordinal (B) Numerical, Categorical ordinal, Numerical, Categorical nominal (C) Numerical, Categorical nominal, Numerical, Categorical ordinal (D) Categorical ordinal, Categorical nominal, Categorical ordinal, Categorical nominal Answer is (B). Education level and Nationality are categorical variables. There is some natural ordering for Education level but not for Nationality. Therefore, Education level is categorical ordinal and Nationality is categorical nominal. Age and Family monthly wage are numerical variables. 13 1. The figure below shows that out of every 9 Singaporeans, 1 of them has diabetes. Similarly, out of 10 Singaporeans over 60, 3 of them have diabetes. Let us define “over 60” as old and “60 and below” as young. Which of the following statements is/are true? Select all that apply. (A) rate(Diabetes | Young) > rate(Diabetes | Old). (B) rate(Young | Diabetes) < rate(Young | No diabetes). (C) rate(Old | Diabetes) < rate(Old | No diabetes). (D) rate(Diabetes | Young) < rate(Diabetes | Old). (B) and (D) are correct. From the above, we can see that the overall rate of diabetes is 1/9 = 0.111. Additionally, we also know that rate(Diabetes | Old) = 0.3. Using the basic rule of rates, we can deduce that rate(Diabetes | Young) must be less than 0.111, and by extension, less that rate(Diabetes | Old). Since rate(Diabetes | Young) < rate(Diabetes | Old), it must also mean that rate(Young | Diabetes) < rate(Young | No diabetes). Since rate(Diabetes | Old) > rate(Diabetes | Young), we can also conclude that rate(Old | Diabetes) > rate(Old | No diabetes). 2. In a survey conducted at a primary school, where 1000 students responded, 55% of the respondents indicated that they sleep more than 8 hours every day. Among those who sleep more than 8 hours every day, 70% wear spectacles. We also know that 50% of the respondents wear spectacles. The percentage (computed to 3 significant figures) of respondents who wear spectacles among those who sleep at most 8 hours every day . is By the basic rule of rates, since 50% of all respondents wear spectacles, and 70% among those who sleep more than 8 hours every day wear spectacles, the percentage among those who sleep at most 8 hours every day must be less than 50%. In fact, this percentage can be calculated exactly. Consider the following contingency table. Figures that can be derived from the question are inserted. The rest of the table can be completed in the order (1), (2), (3). 1 Sleep more than 8 hours/day Sleep at most 8 hours/day Column total Wear spectacles 0.7 ∗ 550 = 385 (2) 500 − 385 = 115 500 Do not wear spectacles (1) 550 − 385 = 165 (3) 450 − 115 = 335 500 Row total 550 450 1000 Among those that sleep at most 8 hours every day, the percentage of respondents who wears spectacles is 115/450 ∗ 100% = 25.6%. 3. The contingency table below shows the income level for a group of adults classified by gender. Gender Male Female Total Low 23 26 49 Income level Middle High 30 29 28 20 58 49 Total 82 74 156 The marginal rate, rate(Middle income), is calculated to be (1) the joint rate, rate(Female and Low income), is calculated to be Give each answer as a percentage correct to 2 decimal places. (2) %; while %. To calculate the marginal rate, rate(Middle income), we take the column total of all Middle income persons divided by the grand total of everyone in the sample, i.e. 58/156 ≈ 37.18%. Thus the answer to (1) is 37.18. Then, to calculate the joint rate, rate(Female and Low income), we take the count of “Females with Low income” divided by once again the grand total of everyone in the sample, i.e. 26/156 ≈ 16.67%. Thus the answer to (2) is 16.67. 4. An experiment was done to evaluate the survival outcomes of a new drug X, against the current standard Imatinib, for treating Chronic Myeloid Leukemia (CML). 459 CML patients were assigned to two groups. One group was put on treatment X and the other on Imatinib treatment, over a course of 36 months. A portion of the experiment results is summarised in the table below. Male Female Total X Patients Deaths 100 6 125 17 225 23 Imatinib Patients Deaths 104 12 130 15 234 27 Which of the following statements is/are true? Select all that apply. (A) The assignment of patients to the two groups could not have been random because the two groups do not have the same number of patients. (B) Being male is positively associated with the survival from CML. (C) Treatment X (compared to Imatinib) is positively associated with survival from CML. (D) Sex is a confounder in the relationship between treatment type and survival from CML. (B) and (C) are correct. Random assignment does not guarantee that the number of patients will be exactly the same in the different treatment groups. rate(Survival | Male) = 100+104−12−6 = 186 100+104 204 = 0.91. rate(Survival | Female) = 125+130−17−15 = 223 125+130 255 = 0.87. 2 Since rate(Survival | Male) is greater than rate(Survival | Female), being male is positively associated with survival. = 202 rate(Survival | X) = 225−23 225 225 = 0.90. = 0.88. rate(Survival | Imatinib) = 234−27 234 Since rate(Survival | X) is greater than rate(Survival | Imatinib), treatment X is positively associated with survival. 4 104 rate(Male | X) = 100 225 = 9 = 234 = rate(Male | Imatinib). So sex and treatment type are not associated. As sex is not associated with treatment type, it cannot be a confounder in the relationship between treatment type and survival. Recall that a confounder must be associated with both the independent and dependent variables. 5. Consider the following figure, which depicts the proportion of working husbands and wives in Singapore. Suppose that couples with BOTH husband and wife working are considered as “Dual-career” whereas all other scenarios are considered as “Nondual-career”. Which of the following statements is/are true amongst married couples in the years 2010 and 2015? Select all that apply. (A) The year 2015 is positively associated with dual-career couples. (B) The year 2015 is positively associated with non-dual-career couples. (C) rate(Dual-career | 2015) > rate(Dual-career | 2010). (D) rate(Non-dual-career | 2010) > rate(Non-dual-career | 2015). (A), (C) and (D) are correct. Note that rate(Dual-career | 2015) = 53.8% > 47.1% = rate(Dual-career | 2010). So the year 2015 is positively associated with dual-career couples. This also implies that the year 2015 cannot be positively associated with non-dual-career couples as event B cannot be both positively associated with event A as well as NA. Finally, rate(Non-dual-career | 2015) = 27.7 + 6 + 12.4 = 46.1% < 52.9% = rate(Non-dual-career | 2010). 6. In this question, we are focussed only on the final year students in AMAS university in a certain year. You are given 75% of all these students graduated, and that 60% graduated among the female students. Which of the following statements must be true? Select all that are true. 3 (A) The percentage of graduates among the female students is less than the percentage of graduates among the male students. (B) The percentage of female students among the non-graduates is greater than the percentage of female students among the graduates. (C) The percentage of male students among the graduates is less than the percentage of male students among the non-graduates. (D) The percentage of male students is greater than the percentage of male students among the non-graduates. (A), (B) and (D) are correct. Let G represent graduates, N G represent non-graduates, M represent males, F represent females. Then we are given that rate(G) = 75% and rate(G | F ) = 60%. By the basic rule on rates, since rate(G) > rate(G | F ), we must have rate(G | M ) > rate(G) > rate(G | F ). Hence the percentage of graduates among female students is less than the percentage of graduates among male students. By symmetry rule, we have rate(F | G) < rate(F | N G), and so the percentage of female students among the non-graduates is greater than the percentage of female students among the graduates. By symmetry rule, we also have rate(M | G) > rate(M | N G), and so it is incorrect to say that the percentage of male students among the graduates is less than the percentage of males students among the non-graduates. Finally, by the basic rule on rates, we must have rate(M | G) > rate(M ) > rate(M | N G). Thus the percentage of male students is greater than the percentage of male students among the non-graduates. 7. Joseph conducted a study on night owls (individuals who sleep after 12am on average every night) among staff and students in NUS. He gathered the following result on rates: rate(Night owl | Student) = 0.7; rate(Non-Night owl | Staff) = 0.4. Which of the following statements is correct? (A) rate(Night owl) must be between 0.4 and 0.7. (B) rate(Night owl) must be between 0.6 and 0.7. (C) rate(Non-Night owl) must be between 0.4 and 0.7. (D) rate(Non-Night owl) must be between 0.6 and 0.7. Answer is (B). From rate(Non-Night owl | Staff)= 0.4, we obtain rate(Night owl | Staff) = 1 − 0.4 = 0.6. Since rate(Night owl | Student) = 0.7, applying the basic rule of rates, the overall rate(Night owl) must be between these two rates at the subgroups. Thus rate(Night 4 owl) must be between 0.6 and 0.7. In addition, from rate(Night owl | Student)= 0.7, we obtain rate(Non-Night owl | Student) = 1 − 0.7 = 0.3. Applying the basic rule of rate, the overall rate(Non-Night owl) must be between 0.3 and 0.4. 8. The relative frequency table below shows the distribution of annual total personal income for the entire population of 6,402,386 living in City X. It is known that there are 59% males and 41% females in City X. Income $9,999 or less $10,000 to $14,999 $15,000 to $24,999 $25,000 to $34,999 $35,000 to $49,999 $50,000 to $64,999 $65,000 to $74,999 $75,000 to $99,999 $100,000 or more Percent 2.2% 4.7% 15.8% 18.3% 21.2% 13.9% 5.8% 8.4% 9.7% We are told that in City X, 71.8% of females earn less than $50,000 per year. Which of the following statements is correct? (A) There is positive association between being male and earning less than $50,000. (B) There is negative association between being male and earning less than $50,000. (C) There is no association between being male and earning less than $50,000. (D) We do not have sufficient information to determine the correctness of the other three statements. Answer is (B). We need to figure out the relationship between rate(Less than 50,000 | Males) and rate(Less than 50,000 | Females). The latter is given in the question as 71.8%. From the table, one can calculate the overall rate(Less than 50,000) to be 2.2 + 4.7 + 15.8 + 18.3 + 21.2 = 62.2%. So among everyone, the rate of those earning less than 50,000 is 62.2%, while for females it is higher at 71.8%. By basic rule on rates, the rate among males should be less than 62.2% and hence is less than 71.8%. Thus rate(Less than 50,000 | Males) < rate(Less than 50,000 | Females) which implies that the association between male and earning less than 50,000 is negative. 9. Consider the following statements. (I) A spokesman was quoted as saying that the proportion of a certain ethnic group among those who contracted Disease X was lower than the proportion of that ethnic group in the general population. (II) A reporter interpreted the statement and concluded that the members of this ethnic group are less likely to contract Disease X than a random member of the population. Which of the following is correct? (A) The two statements are equivalent. (B) We can infer statement (I) from (II) but not the other way round. 5 (C) We can infer statement (II) from (I) but not the other way round. (D) We can neither infer statement (I) from (II), nor infer statement (II) from (I). Answer is (A). Let M represent the particular ethnic group (N M represent not the ethnic group), C represent contracting Disease X (N C represent not contracting Disease X). Statement (I) says that rate(M | C) < rate(M ). By the basic rule on rates, rate(M | C) < rate(M ) < rate(M | N C). This implies rate(C | M ) < rate(C) < rate(C | N M ). In particular, rate(C | M ) < rate(C), which is statement (II). 10. A school has some students, who are categorised as follows: In a sports club Not in a sports club Male 20 30 Female 20 80 Which statement below is correct? (A) There is no association between being in a sports club and sex, since the number of male students in a sports club is the same as the number of female students in a sports club. (B) There is an association between being in a sports club and sex, since the number of male students in a sports club is the same as the number of female students in a sports club. (C) There is no association between being in a sports club and sex, since the rate of sports club students among males is different from the rate of sports club students among females. (D) There is an association between being in a sports club and sex, since the rate of sports club students among males is different from the rate of sports club students among females. Answer is (D). To see if there is an association, we compare the rates of sports club students among male and female students. 20 = 0.4; rate(sports club students | male) = 50 20 rate(sports club students | female) = 100 = 0.2. Since the rates are different, there is an association between being in a sports club and sex. 11. Coriander is a common herb used to add flavour to various kinds of dishes. Suppose we have two countries: Country X and Country Y . We have individuals who either dislike coriander, or like coriander, and are either male or female. We know that in country X, rate(Dislike) = 0.1. rate(Dislike | Male) = 0.3. 6 Also, in country Y , rate(Female) = 0.8. rate(Female | Dislike) = 0.4. Which of the following statements must be true? Select all that apply. (A) rate(Male) < rate(Female) in country X. (B) There is a positive association between disliking coriander and females in country X and country Y separately. (C) Overall rate(Male) is between 0.2 and 0.5, when both countries are combined. Only (A) is correct. The Basic Rule on Rates states that rate(A | not B) ≤ rate(A) ≤ rate(A | B). As such, in country X, we note that rate(Dislike | Female) has to be between 0 and 0.1, since rate(Dislike) = 0.1 and rate(Dislike | Male)= 0.3. In any case, rate(Dislike) is closer to rate(Dislike | Female) than it is to rate(Dislike | Male). Hence, rate(Male) < rate(Female) in country X is correct. In country X, rate(Dislike | Male) > rate(Dislike | Female). Therefore, there is a negative association between disliking coriander and being females in country X. In a similar way, using the Basic Rule on Rates, we note that in country Y, rate(Female | Dislike) < rate(Female) < rate(Female | Not Dislike). Therefore, there is a negative association between disliking coriander and being females and statement (B) is incorrect. Finally, in country Y , rate(Male) = 1− rate(Female) = 0.2. On the other hand, we note previously in country X that females should form the majority as rate(Female) > rate(Male). Hence, rate(Male) can be smaller than 0.2. Thus the overall rate(Male) can be between 0 and 0.2 and statement (C) is incorrect. 12. A researcher is interested to find out if the residential location of a person is associated with whether the person is a night owl (only sleeps after midnight). The variables of interest are: Location (East/West of Singapore). Sleep Pattern (Night Owl/ Not Night Owl). The researcher surveyed 1000 people in Singapore and the following is a table showing the results from the researcher’s study. Night Owl Not Night Owl East 120 450 West 330 100 From the table’s results, there is an association between ‘Location’ and ‘Sleep Pattern’ because (A) rate(Night Owl | East) < rate(Night Owl | West). (B) rate(East | Night Owl) < rate(West | Night Owl). 7 (C) rate(Night Owl | West) < rate(Night Owl | East). (D) rate(Night Owl | East) < rate(East | Night Owl). Answer is (A). To check for association, we compare rate(A | B) with rate(A | NB). 120 and rate(Night Owl | West) = 330 Here rate(Night Owl | East) = 570 430 . Since rate(Night Owl | East) < rate(Night Owl | West), there is an association present between ‘Location’ and ‘Sleep Pattern’. 13. Jolie is interested to find out if there is any association between heights of individuals and reading habits. She collected a sample of 60 individuals, all with unique heights. Individuals whose height is equal to or greater than the median height among the 60 individuals are considered as “Tall”,while individuals whose height is lower than the median height were considered as “Not Tall”. Individuals who complete at least 2 books a week were classified as “Frequent Reader”, while individuals who read less than 2 books a week were classified as “Not Frequent Reader”. She then constructed a 2 by 2 contingency table as follows: Tall 12 Frequent Reader Not Frequent Reader Column Total Not Tall Row Total 21 Unfortunately, after constructing the above table on a piece of paper, she spilled coffee on that paper, and she was only able to recover the above figures. What conclusion could possibly be deduced from the above? (A) It is impossible to deduce if “Frequent Reader” is associated with being “Tall” as more information is needed. (B) “Frequent Reader” is positively associated with being “Tall”. (C) “Frequent Reader” is not associated with being “Tall”. (D) “Frequent Reader” is negatively associated with being “Tall”. Answer is (B). Consider the fact that a sample of 60 individuals were selected. Since all individuals have unique heights, the median height would be between the 30th and 31st individual when the individuals are lined up in ascending order of height. Hence, the number of “Tall” people in total would be 30, while the number of “Not Tall” people would be 30. Consequently, the number of “Tall” people who are “Not Frequent Reader” would be 18, and the number of “Not Tall” and “Frequent Reader” would be 9. As such, rate(Frequent Reader | Tall) = 12 30 = 40% while 9 rate(Frequent Reader | Not Tall) = 30 = 30%. Since rate(Frequent Reader | Tall) is greater than rate(Frequent Reader | Not Tall), there is a positive association between being “Frequent Reader” and “Tall”. 14. Consider the following contingency table that shows the smoking status and outcome after 20 years (heart disease or not) for 270 people. Smoker Not Smoker Column Total Heart Disease 60 90 150 No Heart Disease 30 90 120 Row Total 90 180 270 The values of rate(Heart Disease) and rate(Smoker and No Heart Disease) are (1) and (2) , respectively. Give your answers as percentages correct to one decimal place. 8 5 Rate(Heart Disease) = 150 270 = 9 = 55.6%. Rate(Smoker and No Heart Disease) = 30 1 270 = 9 = 11.1%. 15. The table below shows the income level (Low, Middle, High) of 100 adults classified by gender. Male Female Column Total Low 10 15 25 Middle 30 20 50 High 15 10 25 Row Total 55 45 100 The proportion of middle-income earners among males is as a percentage correct to one decimal place. . Give your answer Proportion of middle-income earners amongst males refers to rate(Middle | Male). This means we take the number of middle-income males, divided by the total number of males. Thus rate(Middle | Male) = 30 30 = = 54.5%. 10 + 30 + 15 55 16. The employees of a particular company were surveyed on the average number of hours they spend sitting down each day (between 1 and 6 hours, or more than 6 hours) and whether their Body Mass Index classifies them as being obese. The graph below summarises the results of the survey. Based only on the information from this graph, determine which of the following statements must be correct. For ease of notation, we will identify those who spend 1 to 6 hours sitting down as Group A and those who spend more than 6 hours sitting down as Group B. (I) Out of all the responses, there are more obese employees in group B than in group A. (II) It is more likely that a surveyed employee is from group B if he/she is not obese as compared to if he/she is obese. (A) Only (I). (B) Only (II). (C) Both (I) and (II). 9 (D) Neither (I) nor (II). Answer is (D). Statement (I) may not be correct. Rates do not equate absolute numbers. A higher rate of obese people in group B as compared to that in group A does not mean that there are more obese employees in group B than in group A. Statement (II) is not correct. Since we have rate(Obese | Group A) < rate(Obese | Group B), by symmetry rule, we will have rate(Group A | Obese) < rate(Group A | Not obese). Hence statement (II) is not correct. 17. A professor wanted to know more about media consumption habits among the students in his class. He took note of each student’s gender (“Male” or “Female”) and asked the student which communication platform they used most often (either “Telegram” or “Whatsapp”). It was found that rate(Male | Telegram) > rate(Male | Whatsapp). Which of the following statements must be true? (I) rate(Telegram | Male) > rate(Telegram | Female) (II) rate(Female | Telegram) < rate(Female | Whatsapp) (III) rate(Whatsapp | Female) < rate(Whatsapp | Male) (A) Only (I). (B) Only (I) and (II). (C) All three statements are true. (D) None of the statements is true. Answer is (B). Given that we know from the question that Males and Telegram are positively associated, by the symmetry rule, Statements (I) and (II) are true, while Statement (III) is false. 18. The Singapore Department of Statistics publishes yearly figures for the proportion of citizen singles in the population, across various subgroups like gender and age group. The following is an abridged list of proportions of singlehood for the year 2022, across their respective subgroups, obtained from the online database (https://tablebuilder.singstat.gov.sg/table/TS/M810691): Proportion of Singles (in %) 31.2 34.4 27.9 17.4 16.6 18.0 30-39 Years (Total) 30-39 Years (Males) 30-39 Years (Females) 40-49 Years (Total) 40-49 Years (Males) 40-49 Years (Females) Let us consider the age group of 30-39 Years as people in their “thirties”, and the age group of 40-49 Years as people in their “forties”. Based on the available information above, which of the following statements is/are necessarily true? Select all that apply. (A) Among those in their thirties, rate(Males) > rate(Females). (B) Among those in their forties, rate(Males) < rate(Females). 10 (C) When both age groups are combined, rate(Singles | Males) > rate(Singles | Females). (D) When both age groups are combined, rate(Singles | Males) < rate(Singles | Females). (E) Gender is a confounder in the association between singlehood and age group. (A) and (B) are correct. Among those in their thirties, rate(Singles)=31.2% is closer to rate(Singles | Males)=34.4% than to rate(Singles | Females)=27.9%, therefore (A) is correct. Among those in their forties, rate(Singles)=17.4% is closer to rate(Singles | Females)=18.0% than rate(Singles | Males)=16.6%, therefore (B) is correct. When both age groups are combined, using the Basic Rule on Rates, 16.6% < rate(Singles | Males) < 34.4%; while 18.0% < rate(Singles | Females) < 27.9%. As such, it is possible for rate(Singles | Males) = rate(Singles | Females), therefore (C) and (D) are not necessarily true. By a similar reasoning, (E) is also not necessarily true because gender is not necessarily associated with singlehood and therefore it is not necessarily a confounder. 19. The following infographic was extracted from the Department of Statistics Singapore, depicting the usual mode of transport amongst resident students travelling to school in 2020. Let us classify a student whose usual mode of transport to school is “Public Bus Only”, “MRT/LRT and Public Bus Only”, “MRT/LRT Only” or “Other combinations of MRT/LRT or Public Bus” as a student who takes public transport to school. If the proportion of female resident students travelling to school who take public transport and the proportion of male resident students travelling to school who take public transport are the same, what is the proportion of students who take public transport to school, out of the male resident students who travel to school? Give your answer as a percentage correct to one decimal place. Answer is 54.6%. From the infographic given, we see that the proportion of resident students who take public transport to school is 22.0% + 7.0% + 23.0% + 2.6% = 54.6%. Furthermore, since the proportion of female resident students travelling to school who take public transport and the proportion of male resident students travelling to school who take public transport are the same, by the third consequence of the Basic Rule on Rates, the proportion of students who take public transport to school, out of all the male resident students who travel to school is 54.6%. 20. The following line graph taken from the Ministry of Health website shows the median wait times for admission at the Emergency Medicine Departments at five hospital lo11 cations (CGH, NTFGH, SGH, SKH, TTSH) in Singapore across seven days in October 2023. We consider a hospital to have a long wait time on a particular day if the median wait time for admission is greater than 10 hours. Fridays, Saturdays and Sundays are considered weekends, while the other days are considered weekdays. Based only on the line graph, which of the following statements is/are necessarily true? Select all that apply. (A) Weekdays are positively associated with long wait times at NTFGH. (B) Weekends are negatively associated with long wait times at CGH. (C) There is no association between weekdays and long wait times at SKH. (D) Simpson’s Paradox is observed in the association between wait times and whether a day is a weekday or a weekend. (A) and (C) are true. We may form the following table from the line graph. Long Short CGH 1 3 Weekday NTFGH SGH SKH 3 1 0 1 3 4 TTSH 2 2 CGH 1 2 Weekend NTFGH SGH SKH 0 0 0 3 3 3 At NTFGH, r(Long | Weekday) = 3/4 > 0/3 = r(Long | Weekend), so weekdays are positively associated with long wait times. (A) is correct. At CGH, r(Long | Weekday) = 1/4 < 1/3 = r(Long | Weekend), so weekdays are negatively associated with long wait times. Hence (B) is incorrect. At SKH, r(Long | Weekday) = 0/4 = 0/3 = r(Long | Weekend), so there is no association between weekdays and long wait times. Thus (C) is correct. At SGH, r(Long | Weekday) = 1/4 > 0/3 = r(Long | Weekend), so weekdays are positively associated with long wait times. At TTSH, r(Long | Weekday) = 2/4 > 0/3 = r(Long | Weekend), so weekdays are positively associated with long wait times. Combining all the subgroups, r(Long | Weekday) = 7/20 > 1/15 = r(Long | Weekend), so the overall association between weekdays and long wait times is positive. Since there is a positive association between weekdays and long wait times in three of the five subgroups, Simpson’s Paradox is not observed, hence (D) is incorrect. 12 TTSH 0 3 21. The rate of lung cancer among females in Singapore is 40%, while the rate of lung cancer among males in Singapore is also 40%. Researchers also discovered that the rate of lung cancer among smokers in Singapore is 70%. Which of the following statements is/are true? (I) Sex is a confounder when discussing the relationship between smoking and lung cancer. (II) Lung cancer is positively associated with smoking. (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). Answer is (B). We know that rate(Lung cancer | Females) is 40% which is also equal to rate(Lung cancer | Males). Hence, lung cancer and being a male/female are not associated, which means that sex cannot be a confounder when looking at the relationship between smoking and lung cancer. By the basic rule on rates, we know that rate(Lung cancer) = 40%. Given that rate(Lung cancer | Smokers) = 70%, we know that rate(Lung cancer | Non-smokers) must be less than 40%. Since rate(Lung cancer | Smokers) is greater than rate(Lung cancer | Non-smokers), we know that lung cancer is positively associated with smoking. 22. A researcher is interested in finding out if gender affects whether a student is lefthanded. The information on gender (Male/Female), master hand (Left/Right) and IQ (Above average/Below or At average) of all students was collected. The data was used to plot the figure below. The researcher makes the following two statements. (I) IQ can be a confounder when investigating the relationship between gender and master hand. (II) When finding out if IQ affects whether a person is left-handed, it is possible that Simpson’s Paradox is observed in the data due to the third variable, gender. Which of the following correctly describes the two statements above? 13 (A) Statement (I) is true but statement (II) is false. (B) Statement (I) is false but statement (II) is true. (C) Both statements are true. (D) Both statements are false. Answer is (D). From the graph, we see that rate(Male | Above average IQ) = rate(Male | Below or at average IQ) = 0.6. Alternatively, from the graph, we see that the ratio of male to female in the “Above average” category is the same as in the “Below or at average” category. Hence, IQ is not associated with gender. Since IQ is not associated with gender, it cannot be a confounder when investigating the relationship between gender and master hand. Thus statement (I) is false. Since gender is not associated with IQ, it cannot be a confounder when studying the relationship between IQ and master-hand. Since it is not even a confounder in the first place, Simpson’s Paradox cannot be observed. Thus, statement (II) is false. 23. A researcher conducted an observational study to investigate whether smoking is associated with heart disease. 1000 participants were recruited in the study and the researcher obtained the following result: Smoker Non-smoker Heart disease 146 324 No heart disease 317 213 He also observed that 80% of the smokers were alcoholics, while 85% of the people with heart disease were alcoholics. Consider the following statements: (I) There is positive association between smoking and heart disease. (II) Being an alcoholic is a confounder. Based only on the information given above, which of the two statements must be true? (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). Answer is (C). To conclude if being an alcoholic is a confounder, we need information on the percentage of non-smokers who are alcoholic and the percentage of people who are alcoholics among those who do not have heart disease. From the table, rate(Smokers | Heart disease) = 146 470 = 0.31 and 317 rate(Smokers | No heart disease) = 317+213 = 0.60. Thus, in this example, smoking is negatively associated with heart disease, since the rate of smokers is lower in the heart disease group as compared to the group with no heart disease. 24. Consider a study that intends to examine whether the colour red makes children act impulsively. A group of 500 children were assigned into two groups by the expert opinion of a child psychologist; group Red if the psychologist pointed to the child, 14 and group Green if the psychologist did not. Each child is then led into a room that has a big button in the colour of their group and labelled “DO NOT PRESS ME!”. It is then recorded whether the child presses the button within 10 minutes. All the children were each then given a candy for participating. The children were also asked if they like candies. The following table summarises the data. For instance, 22 children from group Red that pressed the button do not like candies. Pressed button Did not press button Like candies Red Green 3 135 177 60 Does not like candies Red Green 22 1 38 64 Is liking candy a confounder in this study? (A) Yes. (B) No. (C) There is insufficient information given to determine whether liking candy is a confounder in this study. Answer is (B). We have to check whether liking candy is associated with both the colour of the group, and pressing the button. 3+177 rate(Like candies | Red) = 3+177+22+38 = 0.75. 135+60 rate(Like candies | Green) = 135+60+1+64 = 0.75. Since the two conditional rates are the same, liking candy is not associated to the colour of the group, and is hence not a confounder in this study. Another way to get to this conclusion is to show that rate(Red | Like candies) = rate(Red | Does not like candies) = 0.48. 25. To investigate whether controlling mothers tend to have obese children, a study was conducted and it was found that controlling behaviours in mothers is positively associated with obesity in their children: 250 rate(Obese child | Controlling mom)= 350 = 71%; rate(Obese child | Non-controlling mom)= 138 350 = 39%. It was suspected that the health status of the father plays a role in the phenomenon, so the sample was sliced according to whether the fathers were obese or not. Obese father Non-obese father Overall Non-controlling mother Number Obese child 87 78 263 60 350 138 Controlling mother Number Obese child 270 234 80 16 350 250 Which of the following statements is/are true? (I) Obesity in the father is a confounder; furthermore, this is an example of Simpson’s Paradox being observed. (II) Obesity in children is positively associated with controlling mothers when the father is obese. 15 (A) Only (I). (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (A). Statement (I) is correct. Obesity in the father is a confounder since 87 = 24.9%; rate(Obese father | Non-controlling mom)= 350 rate(Obese father | Controlling mom) = 270 350 = 77.1%. So obesity in father is positively associated with controlling mother. Furthermore, rate(Obese child | Obese father)= 78+234 = 87.4%; 357 rate(Obese child | Non-obese father)= 16+60 343 = 22.2%. So obesity in child is positively associated with obesity in father. In addition, Simpson’s Paradox is observed since for the obese father subgroup, obesity in child is negatively associated with controlling behaviour in mothers (87% vs 90%), and the same is observed in the non-obese father subgroup. The trend is reversed when the two subgroups are combined. Statement (II) is incorrect as stated above, obesity in children is negatively associated with controlling mothers when the father is obese. 26. In a certain year, it is known that the prevalence of diabetes among Singapore residents is 10% and the prevalence of diabetes among old (age 60 and above) Singapore residents is 30%. It was suggested that sex is a possible confounder in the observed association between age and diabetes among Singapore residents. After further analysis, the researchers concluded that sex is not a confounder, and there is an association between sex and age. Which of the following statements is/are true? Select all that apply. (A) rate(Diabetes | Male) = rate(Diabetes | Female). (B) rate(Male | Diabetes) = rate(Female | Diabetes). (C) rate(Diabetes | Female) = 10%. (A) and (C) are correct. Since sex is not a confounder in the study of association between diabetes and age and sex is associated with age, thus there is no association between sex and diabetes. Thus it is correct to say that rate(Diabetes | Male) is equal to rate(Diabetes | Female). Consequently, rate(Diabetes) = rate(Diabetes | Male) = rate(Diabetes | Female) = 10%. On the other hand, we do not know how many diabetic residents are male or female, so we cannot determine if rate(Male | Diabetes) is equal to rate(Female | Diabetes). 27. The graph below, depicting cases of COVID infection (titled “Local cases in the last 14 days by Age group and Vaccination status”), was published in a press release by Singapore’s Ministry of Health on 21 July 2021. 16 Let us designate the age range of those 61 years or older as “seniors” , and all others as “non-seniors”. Let us also consider the status of either full or partial vaccination as “vaccinated”, and the status of being completely unvaccinated as “unvaccinated”. What can be concluded based on the information given? Select all statements that apply. (A) The rate of infection among seniors is lower than the rate of infection among non-seniors. (B) For cases depicted by the graphic, being vaccinated is positively associated with seniors. (C) Age is a confounder in the association between infection and vaccination status. (D) There were about twelve cases on average daily (correct to the nearest whole number) for those senior and vaccinated. (B) and (D) are correct. Representing the information from the graph into the following 2-by-2 contingency table (with the stated conditions to form our categories) will make the calculations for the necessary rates much easier. Seniors Non-seniors Column total Vaccinated 167 479 646 Unvaccinated 22 187 209 Row total 189 666 855 We are not able to compare the rate of infection among seniors with the rate of infection among non-seniors because we do not have figures for non-infected persons. Thus, we cannot compare rate(Infected | Seniors) vs. rate(Infected | Non-seniors). However, we can compare rate(Vaccinated | Seniors) vs. seniors). Using the table: rate(Vaccinated | Non- rate(Vaccinated | Seniors) = 167/189 = 88.36%, which is larger than rate(Vaccinated | Non-seniors) = 479/666 = 71.92%. So being vaccinated and being senior are indeed positively associated with each other. With reference to the title of the graph, we take the number of cases for those senior and vaccinated (which is 167) and divide it by 14 days. Since 167/14 = 11.9, it is true that there were about 12 cases on average daily (correct to the nearest whole number) for those senior and vaccinated. To establish that age is a confounder, we need to find that age is associated with both infection and vaccination status. However, as mentioned above, we have insufficient information to show that age and infection are associated with each other. So 17 we cannot determine if age is a confounder in the association between infection and vaccination status. 28. A researcher is investigating the relationship between alcohol consumption and the risk of an individual contracting liver cancer. He conducts his investigation in a town consisting of 10000 adults. Every adult is classified as an Alcoholic or Non-alcoholic and similarly every adult’s risk of liver cancer is classified as High or Low. During his investigation he realises that rate(Male | Alcoholic) ̸= rate(Male) and rate(Female | High) < rate(Female). Based only on the information above, which of the following statements must be true? (A) Being alcoholic is positively associated with being male. (B) High risk of liver cancer is positively associated with being female. (C) Sex is a confounder in this study. (D) There is an association between the risk of liver cancer and alcohol consumption. Answer is (C). Based on the information given in the question, we cannot determine whether rate(Male | Alcoholic) > rate(Male | Non-alcoholic), so we cannot conclude that being alcoholic is positively associated with being male. Option (B) is false since by the basic rule of rates, we will have rate(Female | High) < rate(Female) < rate(Female | Low), meaning that the association between high risk of liver cancer and being female is negative. Option (C) is true because by the basic rule of rates we must have either rate(Male | Alcoholic) > rate(Male) > rate(Male | Non-alcoholic), or rate(Male | Alcoholic) < rate(Male) < rate(Male | Non-alcoholic). So in either case we obtain that rate(Male | Alcoholic) ̸= rate(Male | Non-alcoholic). Therefore sex is associated with alcohol consumption. By the fact that there is negative association between high risk of liver cancer and being female, we see that sex is associated with the risk of liver cancer. Therefore sex is a confounder. We do not have any information to determine the association between alcohol consumption and liver cancer so we cannot tell whether there is an association between alcohol consumption and risk of liver cancer. 29. In a study investigating the association between having pets and depression, it is found that among males, rate(Depressed) = 0.4 and rate(Depressed | Having pets) = 0.5, whereas among females, rate(Depressed) = 0.3 and rate(Depressed | Having pets) = 0.6. Consider the following two statements: (I) It is possible to observe Simpson’s Paradox in this study. (II) Among females, rate(Having pets | Depressed) > rate(Having pets | Not depressed). Which of the above two statements is/are true? (A) Only (I). 18 (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (B). Being depressed is positively associated with having pets in each gender subgroup due to basic rule on rates, since rate(Depressed | Having pets) > rate(Depressed) implies that rate(Depressed) > rate(Depressed | Not having pets). Statement (I) is false because rate(Depressed) is always between 0.3 and 0.4 and rate(Depressed | Having pets) is always between 0.5 and 0.6 by the basic rules of rates. Thus, being depressed is positively associated with having pets in the overall group and no Simpsons paradox is observed. As having pets is positively associated with being depressed among females, statement (II) is true by symmetry rules of rates. 30. A Taylor Swift fan website forum conducted a post-concert survey for its members who attended The Eras Tour in 2023. The information collected noted the seating types of respondents (“Seating” or “Standing”), self-evaluations of their concert experience (“Good” or “Bad”), as well as their gender (“Male” or “Female”). The data is represented in the following table. Good Bad Seating Male Female 30 39 6 5 Standing Male Female 20 33 1 14 Which of the following statements is/are true based only on the data above? Select all that apply. (A) A “Good” concert experience is positively associated with “Standing” seats. (B) A “Good” concert experience is positively associated with “Seating” seats. (C) A “Good” concert experience is negatively associated with females. (D) Gender is a confounder in the association between concert experience and seating type. (E) Simpson’s Paradox is observed in the association between concert experience and seating type. (B), (C) and (D) are correct. Checking for the association between concert experience and seating type: rate(Good | Seating) = 69/80 > rate(Good | Standing) = 53/68, therefore (A) is not true and B is true. Checking for the association between concert experience and gender: rate(Good | Female) = 72/91 < rate(Good | Male) = 50/57, therefore (C) is true. Checking for the association between seating type and gender: rate(Seating | Female) = 44/91 < rate(Seating | Male) = 36/57, therefore (D) is not true. Among males, checking the direction of association between concert experience and seating type: rate(Good | Seating) = 30/36 < rate(Good | Standing) = 20/21; while among females, checking the direction of association between concert experience and seating type: rate(Good | Seating) = 39/44 > rate(Good | Standing) = 33/47, therefore (E) is not true. 19 1. The results of 10 students in a test are given below, where L indicates that a student obtained a score below 55, and the two H’s (not necessarily the same) indicate scores above 90. The maximum possible score is 100. 60 80 L 70 H 83 59 H 70 65. Which of the following statements must be true about the scores given above? Select all that apply. (A) The median score is 70. (B) There is only one mode among them. (C) The range is greater than 35. (D) The mean score is greater than 72.2. (A) and (C) are true. After reordering the 10 scores, we have: < 55 59 60 65 70 70 80 83 > 90 > 90. The median is the average of the 5th and 6th score, i.e. 70. Here, 70 is one mode. However, we may have two modes, if both scores above 90 are the same. Hence, the statement on mode may not be true. The range is greater than 90 − 55 = 35. Hence, the statement on range is true. Here, 72.2 = 55 + 59 + 60 + 65 + 70 + 70 + 80 + 83 + 90 + 90 , 10 It is possible that the scores corresponding to L, H and H are for example, 50, 91, 91 respectively. Then the mean score will be 71.9. Hence the statement on mean may not be true. 2. Which of the statements below will always hold if an outlier for a single numerical variable is removed? (I) The mean will decrease. (II) The median will decrease. (III) The mode will decrease. (A) Only (I). (B) Only (I) and (II). (C) Only (II) and (III). (D) None of the statements. Answer is (D). We can give a suitable example to show that all the statements do not hold. In the set of numbers (−1000, 0, 0, 1), the outlier is −1000. The mean values with and without the outlier are −249.75 and 0.33 respectively. Hence the mean actually increases without the outlier. It is clear that the median and mode of 0 remain unchanged after the removal of the outlier. 3. The graph below compares two sets of grades of a cohort of 233 students who took a course. 1 The histograms of the same two sets of grades are provided below. Based on the figures, which of the following statements is/are correct? Select all that apply. (A) The boxplot on the left corresponds to the midterm grades. (B) The median of the final grades is not higher than that of the midterm grades. (C) The bin width of the histogram for final grades is larger. (D) There are more outliers in the set of final grades than that of the midterm grades. (B) and (D) are true. Since the minimum value in the boxplot on the right is below 40, the boxplot on the right must correspond to the midterm grades. Thus the boxplot on the left is for the final grades. We can see from the boxplots that the median of the final grades is slightly lower than that of the midterm grades. The bin widths of both histograms are 10. It is clear that there are outliers in the boxplot for final grades but not in the boxplot for midterm grades. 4. The Registry of Marriages is interested to see the relationship between the ages of husbands and wives in City X. They randomly sampled 1000 pairs of husbands and wives from the population of City X and obtained data of their ages (in years). Looking through the data, they found that men always marry women who are younger than them. Based only on the information given above, which of the following statements must be true? (I) The average age of the husbands is more than the average age of the wives. (II) The standard deviation of husband’s age is more than the standard deviation of wife’s age. (A) Only (I). 2 (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (A). Since all the husbands are older than their respective wives, the average age of the husbands would be greater than the average age of the wives, and thus statement (I) is correct. On the other hand, we do not have information about the spread of the husband’s age relative to the spread of the wife’s age, so we cannot be sure if statement (II) is correct. 5. The following boxplot shows the final examination marks of students from class A. The passing mark for the final examination is 12 out of 20. Suppose that the boxplot of the final examination marks for class B is the same as the boxplot for class A. What can be said about the final examination marks of students from class B? Select all the correct statements. (A) The proportion of students in class B who passed the examination must be the same as that for class A. (B) At least 50% of the students from class B failed the final examination. (C) The standard deviation of the students’ marks in class B is equal to the standard deviation of those in class A. (D) Based on the boxplot, there are no outliers in the grades of students from class B. (E) The average of the students’ marks for class B must be equal to that for class A. (B) and (D) are correct. The boxplot above does not tell us anything about the percentage of students who obtained at least 12 out of 20 in the final examination. 3 The median of the final marks is 10, which implies that at least 50% of students failed the examination. The boxplot does not tell us anything about the average and standard deviation of the final examination marks. It is clear from the boxplot that there are no outliers. 6. The five-number summary of a numerical variable with 47 values is: Min 12.0 Q1 15.0 Median 16.5 Q3 18.0 Max 24.0 Which of the following statements must be true? Select all that apply. (A) There are no outliers in the data. (B) There is at least one low outlier in the data. (C) There is at least one high outlier in the data. (D) There are both low and high outliers in the data. Only (C) is true. The IQR is 18 − 15 = 3. Since 24 > 18 + 1.5 × 3 = 22.5, 24 is a high outlier and thus there is at least one high outlier. As 15 − 1.5 × 3 = 10.5 and there are no values smaller than 12, there are no low outliers. 7. Suppose that the following are 10 data points for a numerical variable X: 3, 50, r, 8, 20, 1, 32, 58, 10, 138, where r is an unknown whole number and r ̸= 138. Based on the definition of an outlier for a boxplot, if 138 is the only outlier in this data set, the maximum possible . value of r is Answer is 133. Firstly, in order to have 138 as the only outlier, we must have r < 138. Then since we want to find the maximum possible value of r, when we order the other 9 points from the smallest to the largest value, we can try to let r be a number between the second largest number and 138. That is, 1, 3, 8, 10, 20, 32, 50, 58, r, 138. We can see that Q1 = 8, Q3 = 58 and IQR= 50. Thus, any number greater than Q3 + 1.5 × IQR = 58 + 1.5 × 50 = 133 will be an outlier. This means that the maximum possible value of r is 133. 8. Alice wants to compare the distribution of car transaction prices in 2020 with that in 2021. In her comparison, she also wants to highlight that the number of transactions in 2021 is much larger than that in 2020. Which of the following best serves her purpose? (A) Plotting two boxplots (one for each year) of car transaction prices along the same Y -axis. (B) Plotting two histograms (one for each year) of car transaction prices along the same Y -axis. (C) Plotting a bar chart of the number of transactions against year. (D) Plotting a scatter diagram of car transaction prices in 2021 against car transaction prices in 2020. 4 Answer is (B). Only (A) and (B) incorporate tools to compare distributions of numerical data. Since a histogram displays information about the size of the data set and a boxplot does not, only (B) can show the difference in the number of transactions between the two years. 9. The CEO of a company wishes to find out the level of job satisfaction that his employees have. His company has 876 employees. He administers an anonymised survey in which there are questions that ask about the employees’ satisfaction with regards to various aspects of their jobs, including welfare, remuneration, career progression, learning etc. For each question, employees are to rate their satisfaction on a scale of 1-9. In this scale, 1 represents the lowest level of satisfaction whilst 9 represents the highest level of satisfaction. Broadly, any score below 5 for a question is regarded as “not satisfied” and any score above 5 for a question is regarded as “satisfied” while a score of 5 represents being “neutral”. Every employee fills in the survey based on how he/she feels about the job and the data is collected. Assume that every employee is honest in the response and is fully aware of what the values on the scale represent. What type(s) of analysis is/are considered appropriate for the data on the satisfaction scores? Select all that apply. (A) For each question, one can perform a summary statistics calculation on the satisfaction scores to obtain the mean, standard deviation, median, as well as the interquartile range, thereby giving quantitative information on the employees’ satisfaction as a whole. (B) For each question, one can plot a histogram to depict the distribution of the satisfaction scores and perform a calculation to determine the exact degree of skewness of the histogram. (C) For each question, one can calculate the proportion of employees who gave each satisfaction score and determine what percentage of employees are “not satisfied” and what percentage are “satisfied”. Only (C) is correct. As satisfaction scores can be subjective and differences in satisfaction levels need not be the same, this variable cannot be treated as a numerical variable and is an ordinal variable. Hence, while we can compute summary statistics and perform arithmetic, after doing so, we are unable to establish any numerical meaning from these statistics. A histogram is used to plot the distribution of a numerical variable, so once again it is not appropriate to use a histogram to plot this data and we should not perform any calculation to determine its skewness. For each question, it is appropriate to calculate the proportion of responses for each of the satisfaction scores and then determine the overall proportion of employees who are “satisfied” and “not satisfied” which is generally how we should deal with categorical variables. 10. The following histogram and boxplot are constructed using 90 observations of the numerical variable x. Based on the graphics, determine which of the following statements must be true. 5 (A) The minimum value is 2.73. (B) The distribution is right-skewed. (C) Only a quarter of the observations is higher than 4.32. (D) None of the other statements must be true. Answer is (D). The minimum value corresponds to the outlier shown in the boxplot, which is lower than 2.73. The distribution is left-skewed. 4.32 is the median, hence (C) is also not true. 11. Suppose that the following are 10 data points for a numerical variable X: −150, 50, r, 8, 20, 1, 32, 70, 10, 5, where r is an unknown number and r ̸= −150. Based on the definition of an outlier for a boxplot, if −150 is the only outlier in this data set, the minimum possible value of r is (A) −149.9. (B) −45.5. (C) −30. (D) −14.5. Answer is (B). Firstly, in order to have -150 as the only outlier, we must have r > −150. Then since we want to find the minimum possible value of r, when we order the other 9 points from the smallest to the largest value, we can try to let r be a number between -150 and the second smallest number. That is, −150, r, 1, 5, 8, 10, 20, 32, 50, 70. We can see that Q1 = 1, Q3 = 32 and IQR = 31. Thus, any number smaller than Q1 − 1.5 × IQR = 1 – 1.5 × 31 = −45.5 will be an outlier. This means that the minimum possible value of r is -45.5. 12. The graph below shows a scatter plot between exam scores for 100 students and their corresponding number of days absent from school throughout a year. The maximum mark for the exam is 100. There is an outlier present in the scatter plot. 6 From the scatter plot above, which of the following statements is/are true? Select all that apply. (A) After removing the outlier, the strength of the linear association between the two variables will increase. (B) After removing the outlier, the correlation coefficient between the two variables will increase. (C) Higher absenteeism rate causes lower exam scores. (D) If a linear regression line is fitted to this data, the gradient of the line will be negative. (A) and (D) are true. The correlation between the two variables is negative. After removing the outlier, the strength of the linear association between the two variables will increase, hence (A) is true. The correlation coefficient after removing the outlier will decrease as it will go closer to -1, hence (B) is false. (C) is false as this statement is causation, not correlation. (D) is true as the sign of the correlation coefficient can tell us the sign of the gradient of the regression line. Since the correlation coefficient is negative, as seen in the scatter plot, the gradient of the regression line will also be negative. 13. Mary wanted to find out if the duration spent on playing video games is associated with the amount of sleep one gets. She collected data from 5000 participants who played video games for 6 to 10 hours each day. She found that the participants’ number of sleep hours varied from 6 to 8 hours each day, and calculated the formula of the linear regression line to be y = 11 − 0.5x, where x represents the number of hours spent playing video games each day and y represents the number of sleep hours each day. From the data, what is a reliable prediction of the average number of sleep hours for a person who plays an hour of video games each day? (A) 10. (B) 9. (C) 10.5. 7 (D) 20. (E) There is insufficient data to make a reliable prediction. Answer is (E). The linear regression line formula would not provide us a reliable prediction of the average number of sleep hours for hours spent playing video games beyond the observed range of 6 to 10 hours. The best fit line may change beyond the observed range or there could be a non-linear relationship between the two variables. 14. The following scatter plot has an outlier, which has both positive X and Y values. X ranges from 0 to 100 and Y ranges from 0 to 30. What will happen to the correlation coefficient between X and Y if we remove the outlier? (A) It will remain the same. (B) It will decrease. (C) It will increase. Answer is (B). Initially, the correlation coefficient is positive. When we remove the outlier, the data points form an “ellipse” and thus the new correlation coefficient between X and Y is close to 0, meaning that the correlation coefficient decreased. 15. There are at least 3 known scales that are used to measure temperature: namely, the Kelvin scale (denoted by K), the Fahrenheit scale (denoted by o F), and the Celsius scale (denoted by o C). The relationships between the 3 scales are as follows: Temperature in Fahrenheit = 59 × Temperature in Celsius +32. Temperature in Kelvin = Temperature in Celsius +273.15. Within a data set, it is known that the correlation coefficient between numerical variables x and T is 0.72 where T is the temperature measured in Degrees Fahrenheit. The equation of the regression line is T = 0.32x + 87 for x ranging from 0 to 5 inclusive. Based on this information, which of the following statements is/are true? Select all that apply. (A) If we change the scale of temperature to Kelvin, the correlation coefficient and gradient of the regression line will remain unchanged. (B) If we change the scale of temperature to Degree Celsius, the correlation coefficient and gradient of the regression line will remain unchanged. 8 (C) If we change the scale of temperature to Kelvin, only the correlation coefficient remains the same but the gradient of the regression line changes. (D) If we change the scale of temperature to Degree Celsius, only the correlation coefficient remains the same but the gradient of the regression line changes. (E) If we change the scale of the temperature to Kelvin, both the correlation coefficient and the gradient of the regression line will change. (C) and (D) are true. The correlation coefficient is unchanged by adding and subtracting constants as well as multiplying and dividing by positive constants. Using the relationship sy m=r sx when changing from Fahrenheit to Celsius, it involves multiplying by 59 which will cause the standard deviation of temperature to change. But since r remains unchanged, the gradient therefore has to change. But a change from Celsius to Kelvin does not cause any further change to the gradient since we are only adding a constant. Therefore, a change from Fahrenheit to Kelvin will result in a change to the gradient. 16. Which of the following statements is/are true? (I) A correlation coefficient of 0 means that there is no linear association between the two variables. (II) A correlation coefficient of −0.8 indicates a weaker linear association than a correlation coefficient of 0.7. (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). Answer is (A). Statement (I) is true. The correlation coefficient is a measure of linear association between two variables, but it is unable to measure non-linear association. Statement (II) is false as the correlation coefficient of −0.8 shows a stronger linear association than 0.7. This is because the magnitude of the first correlation coefficient, 0.8, is greater than the second one, 0.7. The negative sign only indicates the direction of the correlation, but not the strength. 17. Xiao Ming finds that the heavier a person is, the higher his pulse rate tends to be. A linear regression model he built suggests that 20-kilogram differences in weight correspond to differences in pulse rates of 5 beats per minute. Which of the following statements is/are always correct? (I) The correlation coefficient between body weight and pulse rate is 41 . (II) Your pulse rate slows down 3 beats per minute if your weight decreases by 12 kilograms. (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). Answer is (C). Statement (I) is false as the gradient of the regression line ( 14 in this case) need not be the same as the correlation coefficient between the variables. Statement 9 (II) is false as the regression line gradient gives only the “predicted average value of y” given a value of x. 18. A researcher predicts the total number of bacteria in an experiment (denoted by y) using simple linear regression on ln y vs x. The regression equation is given by ln y = 0.5x + 2.5, where the logarithm here is taken to base e, and x is the number of hours since 12PM. Which of the following statements is/are always correct? (I) According to the researcher’s model, there will be an average of 33 (rounded to the nearest integer) bacteria in 2 hours. (II) The average number of bacteria is predicted to increase by the same amount every hour. (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). Answer is (A). When x = 2, ln y = 3.5. Thus, we obtain y = e3.5 = 33. Statement (II) is not correct. We can verify this explicitly. For instance, from x = 0 to x = 1, the predicted average number of bacteria increases by 8. From x = 1 to x = 2, the predicted average number increases by 13. 19. A group of students wanted to investigate the relationship between bubble tea and ice cream consumption. They conducted a survey to find out the number of cups of bubble tea and scoops of ice cream consumed in a year. From the data collected, they drew a scatter plot and fitted a linear regression line to the data with the equation of y = 0.73x. Based on the information given in the scatter plot, which statement(s) must be true? (I) The correlation between bubble tea and ice cream consumption is 0.73. (II) People who consume 100 cups of bubble tea are predicted to consume 73 scoops of ice cream on average. 10 (A) Only (I). (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (D). The regression line equation tells us about the slope and intercept of the best-fit line. In this case, the slope is 0.73 and the intercept is 0. The slope may not be equal to the correlation coefficient in general so statement (I) may not be true. There is no information about people who consume more than 70 cups of bubble tea from the current data. Hence we cannot use the regression line to predict the average number of scoops of ice cream consumed for someone who consumed 100 cups of bubble tea. Hence statement (II) is not true. 20. There are two primary six classes in a tuition center. Class A and Class B each has 100 students and all students sat for a mathematics midterm test as well as a final examination. In Class A, every student scores 1 point higher in the final examination than in the midterm. In Class B, every student scores 1 point lower in the final examination than in the midterm. For the midterm test, the average score is 50 and standard deviation is 20 for both classes A and B. Which of the following statements is/are correct? (I) The correlation coefficient between the final examination score and the midterm score in Class A is 1. (II) The correlation coefficient between the final examination score and the midterm score in Class B is −1. (A) Only (I). (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (A). Note that the relationship between the midterm score and final examination score is deterministic. Furthermore, since standard deviation is 20, not all students scored the same midterm/final examination mark. For Class A, imagine the points (49,50), (50,51), (52,53) and so on, each (x, y) representing the midterm score x and final exam score y. Clearly, these points lie on the straight line y = x + 1 with positive gradient. So the correlation coefficient between final examination score and the midterm score in Class A is 1. Similarly, for Class B, imagine the points (49,48), (50,49), (52, 51) and so on. These points also lie on a straight line y = x − 1 with positive gradient. The correlation coefficient between final examination score and the midterm score in Class B is also 1. 21. A student produces a scatter plot whereby the y values range from 1 to 3, and he observes a linear association, with regression equation y = 1.5x + 5. Which of the following statements is/are correct? (I) The correlation coefficient must be positive. (II) The average y value of the data set collected is 2. (A) Only (I). (B) Only (II). (C) Both (I) and (II). 11 (D) Neither (I) nor (II). Answer is (A). Since the regression line has a positive slope, the correlation coefficient must be positive and thus statement (I) is correct. While we know that the y values ranges from 1 to 3, it does not mean that the average value of y is 2, so statement (II) is not correct. 22. A set of bivariate data has the following properties: n = 20, x = 4, y = 5, sx = 1, sy = 2, r = 0.5, 0 ≤ x ≤ 10. For x = 5, the predicted average value of y is . Answer is 6. Let the regression equation be y = mx + b. Note that sy m = r = 1, sx and thus b = y − mx = 1. The regression line is thus y = x + 1. When x = 5, the predicted average value of y is 6. 23. Dash is interested to find out if there is a correlation between the time spent in playing computer games (in hours) and student’s score (out of 100). After slicing the data according to gender, it was found that the regression line for males is score = −1.5 × hours spent gaming + 85 and the regression line for females is score = −1.3 × hours spent gaming + 87.5. Furthermore, Simpson’s Paradox is observed when the 2 subgroups are combined. Which of the following is/are possible ranges of r, the correlation coefficient between the time spent gaming and the students’ score when the subgroups are combined? (A) 0 ≤ r ≤ 1. (B) −1 ≤ r < 0. (C) There is not enough information to guess the range of r. Answer is (A). Since the coefficient of hours spent gaming is negative for both subgroups, this must also mean that the correlation coefficient between hours spent gaming and score must be negative. Additionally, we also know that Simpson’s Paradox is observed when the 2 subgroups are combined. Hence, the association between hours spent gaming and score must be nonnegative, which means that 0 ≤ r ≤ 1. 24. There are 50 students in a class and the average (or mean) number of pens per student is 4. Not all students in this class have the same number of pens. For a student with 4 pens, what is the standard unit for the number of pens for this student? (A) −1. (B) 0. (C) 1. (D) The standard deviation is required to answer this question. Answer is (B). Note that standard deviation is non-zero here, since not all students in this class have the same number of pens; there is a spread around the mean 4. For a student with x = 4 pens, since x = 4, the standard unit for the number of pens for this student is x−x = 0. sx 12 25. Which of the following gives the most likely correct values of the correlation coefficients obtained from the four scatter plots? Here, r1 refers to the correlation coefficient of plot (1), r2 refers to the correlation coefficient of plot (2) and so on. (A) r1 = 0.8, r2 = 0.92, r3 = 0.45, r4 = −0.7. (B) r1 = 0.06, r2 = 0.92, r3 = 0.9, r4 = −0.7. (C) r1 = 0.06, r2 = 0.92, r3 = 0.45, r4 = −0.7. (D) r1 = 0.06, r2 = 0.92, r3 = 0.45, r4 = 0.05. Answer is (C). For plot (1), there is a strong non-linear association, but the linear correlation is not strong. For plot (2), there is a relatively strong linear correlation. Plot (3) shows a weak linear correlation and plot (4) shows a negative correlation. Considering all the above facts, (C) is the best option. 26. A researcher wishes to examine the association between the number of hours students engage in gaming and their academic performance. He conducts his study in 15 schools, and calculates for each school the average number of hours students spend gaming and the average academic score of students. He notices that there is a correlation of −0.9 between the two sets of averages. He concludes that “The correlation between gaming hours and academic score for all students from the 15 schools is −0.9”. What fallacy is the researcher committing here? (A) Atomistic fallacy. (B) Ecological fallacy. Answer is (B). The researcher is trying to obtain the correlation at the individual level solely from the correlation at the aggregate level. This is an example of ecological fallacy. 27. In a study on the impact of exercise on weight loss, researchers gather data on the exercise habits and weight changes of all students in a particular school over a sixmonth period. After analysing the data, they find that for individual participants, there is no correlation between the amount of exercise they engage in and the amount 13 of weight they lose. Some participants lose weight without exercising much, while others exercise regularly but don’t experience weight loss. Based on these individual-level findings, the researchers conclude that if they were to compute the average exercise duration and average weight loss for each class in the school, there will also be no correlation between these aggregated averages. Which one of the following mistakes have been committed? (A) Atomistic Fallacy. (B) Ecological Fallacy. (C) Confusing correlation with causation. (D) None of the other options is correct. Answer is (A). This conclusion is an Atomistic Fallacy because it incorrectly generalizes the lack of a correlation at the individual level to the aggregated level. The study fails to account for potential variations among individuals and the possibility that exercise may have different effects on weight loss for different people. It is possible that while at the individual level we may not observe any correlation between exercise and weight loss, at the aggregated level, exercise could still be correlated with weight management. 28. Which of the following is an example of Simpson’s Paradox? (A) The gradient of the regression line for y given x for all individuals in a data set is −0.2. When partitioned into 3 different subgroups, the gradient of the regression line for the same two variables are 2, 3.4 and −5. (B) The overall correlation between x and y in a data set is 0.01. When partitioned into 3 different subgroups, the correlation between the same two variables are 0.21, 0.13 and −0.7. (C) The intercept of the regression line for y given x for all individuals is 1. When partitioned into 3 different subgroups, the intercept of the regression line for the same two variables are −2, −0.2, −1 and −0.3. (D) The overall correlation between x and y in a data set is 0.2. When partitioned into 4 different subgroups, the correlation between the same two variables are −0.43, −0.36, 0.51 and 0.91. Answer is (A). Intercept does not show us the direction of the association and all 4 could be the same direction as the overall. Correlation and gradient both show us direction but only option (B) has more than half of the subgroups having a different direction of association than the overall. 29. Suppose that the relationship between two variables can be best modelled by an exponential function of the form y = axb , where a and b are positive constants and x > 0. Which set of variables should we plot if we wish to observe a linear relationship? (A) y versus ln(x). (B) ln(y) versus x. (C) ln(y) versus ln(x). Answer is (C). y = axb can be transformed into ln(y) = ln(a) + b ln(x), which is in the form of a linear equation Y = mX + c, with Y = ln(y) and X = ln(x). 30. Suppose you are a student researcher interested in studying the relationship between the average number of hours students spend studying in a day (independent variable) 14 and their final exam scores (dependent variable). You collect data from a sample of 50 students, recording the average number of hours each student studied a day and their corresponding final exam scores. From the data collected you gathered the following information: 3 hours ≤ Number of Hours Spent Studying ≤ 10 hours 2 marks ≤ Final Exam Scores ≤ 35 marks Equation of the regression line: y = 4x − 10 What will be the predicted marks of a student if he studies for 11 hours on average in a day? (A) 4 marks (B) 9 marks (C) 34 marks (D) We are not able to give a good prediction with the given information. Answer is (D). We are not able to give a good prediction as the x-variable has a maximum value of 10 hours, while the student studied for 11 hours on average. 15 1. A player rolls a fair six-sided die twice. You can assume the rolls are independent. We define the following events: A: The first roll shows numbers 1 or 2. B: The second roll shows numbers 5 or 6. C: The sum of the two rolls is less than or equal to 7. Consider the following statements: (I) P (B | C) > P (B). (II) P (A and C) = P (A) × P (C). Which of the statements above must be true? (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). Answer is (C). Statement (I) is false since P (B | C) = 3 1 P (B and C) = 36 = , 21 P (C) 7 36 1 while P (B) = 13 . Statement (II) is false because P (A and C) = 11 36 while P (A) = 3 21 7 and P (C) = 36 = 12 . 2. The sample space S of a probability experiment comprises the 30 days of September 2021. Let A denote the first 10 days of September 2021. Let B denote 1st September 2021. Which of the following statements is/are correct? Select all that apply. (A) A is an event of S and B is an outcome of S. (B) A is an outcome of S and B is an event of S. (C) Both A and B are events of S. (D) Both A and B are outcomes of S. (A) and (C) are correct. A is a sub-collection of S so it is an event of S. B is an element of S so it is an outcome of S. Since we also regard outcomes as events, both A and B are events of S. 3. A study was conducted on 1000 students in a secondary school on 11th January 2021. On that day, of these 1000 students, 80% of the students took the MRT (mass rapid transit) to school. Among those students that took the MRT to school, 10% were late for school. Among those students who did not take the MRT to school, 20% were late for school. Among those students who participated in the study and were late for school, a student was randomly chosen, with every student having the same chance of being selected. What is the probability (correct to 1 decimal place) that the chosen student did not take the MRT? (A) 20.0 (B) 10.0 (C) 33.3 1 (D) 30.0 Answer is (C). The total number of students is 1000 and since P (Take MRT) = 0.8 and P (Did not take MRT) = 0.2, hence 800 students took MRT and 200 students did not. Among 800 students who took MRT to school, 10% were late. That is, P (Late | Take MRT) = 0.1, so there were 80 students who took the MRT and were late. Among 200 students who did not take MRT to school, 20% were late. So P (Late | Did not take MRT) = 0.2, so there were 40 students who did not take the MRT and were late. Put the calculated figures in a 2 × 2 table as shown below. Late 80 40 120 Took MRT Did not take MRT Column total Not late 720 160 880 Row total 800 200 1000 From the table, we can calculate P (Did not take MRT | Late) = 40 = 33.3% 120 (1 decimal place). 4. A football match is played between Singapore and Italy, at the Singapore National Stadium. 90% of the spectators in the stadium are wearing red. Among the spectators in the stadium wearing red, 80% are rooting for Singapore. Among the spectators in the stadium not wearing red, 80% are rooting for Singapore. Among all the spectators in the stadium, a spectator is randomly chosen with every spectator having the same chance of being chosen. Let A be the event that the chosen spectator is rooting for Singapore, and B be the event that the chosen spectator is wearing red. Which of the following statements, regarding the 2 events A and B is true? (A) The 2 events are mutually exclusive and independent. (B) The 2 events are mutually exclusive but not independent. (C) The 2 events are independent but not mutually exclusive. (D) The 2 events are neither independent nor mutually exclusive. Answer is (C). By the basic rule on rates, observe that P (Root for Singapore | Wear red) = P (Root for Singapore | Does not wear red) = P (Root for Singapore) = 0.8 Since P (Root for Singapore) = P (Root for Singapore | Wear red) = 0.8, rooting for Singapore and wearing red are independent events. Here P (Root for Singapore) + P (Wear red) = 1.7 ̸= P (Root for Singapore or Wear red), since P (Root for Singapore or Wear red) ≤ 1. Hence rooting for Singapore and wearing red are not mutually exclusive events. 2 5. Systematic sampling with interval length k = 20 is to be utilised for an exit poll at an event. Suppose a total of 410 people exit the event. What is the probability that the 57th person to exit the event is selected for the poll? 1 . (A) 317 1 (B) 20 . 1 (C) 410 (D) Cannot be determined with the information provided. Answer is (B). The 57th person will be selected if and only if number 17 is chosen as the starting point, out of the integers 1 to 20. But since the selection of the starting point is random (meaning every number has the same chance of being selected), each 1 integer from 1 to 20 has a probability of 20 to be selected, so the probability of the 1 57th person being selected is also 20 . 6. A new machine is invented to detect COVID-19. A clinical study is done with a random sample of size 10000. From the study, both the sensitivity and specificity of the machine are calculated to be 90%. Which of the following statements must be correct? (I) Among those with COVID-19 in the sample, 10% tested negative. (II) Among those who tested negative in the sample, 10% have COVID-19. (A) Only (I). (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (A). Statement (I) is correct. Sensitivity = 90% means that among those with COVID-19, 90% are tested positive. Thus among those with COVID-19, 10% are tested negative. Statement (II) may not be correct. Specificity of the test is P(Tested negative | Without COVID-19) = 90%. We are unable to deduce further with the limited information, if exactly 10% of those who tested negative in the sample, have COVID-19. It depends on the base rate P(COVID-19). 7. There are 3 families X, Y and Z. The families have 2 children each. Family X has 1 boy and 1 girl. Family Y has only 2 girls and Family Z has only 2 boys. 1 child is randomly selected among the 6 children. If the selected child is a boy, what is the probability that he is from family X? (A) 0. (B) 16 . (C) 13 . (D) 12 . Answer is (C). The required probability is the conditional probability P (child from family X | child is a boy) = 3 1 P (child is a boy from family X) 1 = 63 = . P (child is a boy) 3 6 8. Every resident in a town undergoes a new COVID test. The sensitivity and specificity of the test are 90% and 99% respectively. Among those who tested positive, 60% are actually infected with the virus. Which of the following is closest to the base rate for COVID infection in this town? (A) 60%. (B) 16%. (C) 1.6%. (D) 0.16%. Answer is (C). Suppose there are 1000 residents tested positive. By the information provided, 600 are infected with the COVID virus and 400 are not. Infected Not infected Column total Positive test 600 400 1000 Negative test ? ? ? Row total ? ? ? As the sensitivity of the test is 90%, this means that P (Positive test | Infected) = 0.9 = 600 , x where x is the total number of residents infected with COVID. This gives us x = 667 (rounded to nearest integer). As the specificity of the test is 99%, this means that P (Negative test | Not infected) = 0.99 = y − 400 , y where y is the total number of residents not infected with COVID. This gives us y = 40000. Since the total number of residents is 667 + 40000 = 40667, the base rate for COVID 667 infection in the town is 40667 × 100% which is approximately 1.6%. 9. For two events A and B, P (A) = 0.5 and P (B) = 0.2. Which of the following statements is/are always true? Select all that apply. (A) P (A or B) = 0.7 if A and B are independent events. (B) P (A or B) = 0.7 if A and B are mutually exclusive events. (C) P (A and B) = 0.1 if A and B are independent events. (D) P (A and B) = 0.1 if A and B are mutually exclusive events. (B) and (C) are correct. For mutually exclusive events, P (A and B) = 0 and P (A or B) = P (A) + P (B) = 0.7. For independent events, P (A and B) = P (A) × P (B) = 0.1 and P (A or B) = P (A) + P (B) − P (A and B) = 0.6 (or derived via the intuitive Venn diagram shown below). 4 10. The probability of flipping a biased coin and getting two tails in a row is 0.16. Each flip is independent of the other. The probability of flipping that same coin and getting two heads in a row is . Fill in the blank above. Give your answer correct to 2 decimal places. The answer is 0.36. Each flip is√ independent of the other, which means that the probability of getting one tail is 0.16 = 0.4. The probability of getting one head is then 1 − 0.4 = 0.6. Getting two heads in a row will be 0.6 × 0.6 = 0.36. 11. An airline is conducting a lucky draw event on 10 July 2023. It will first make a list of all its 50 flights on that day and then randomly selects one flight from the list. After that, 10 passengers will be randomly chosen from the selected flight to receive lucky draw prizes. It is confirmed that there will be 120 passengers on each of the 50 flights that day. Suppose that Amy will be on exactly one of the flights on 10 July 2023. The ˙ probability of Amy receiving a lucky draw prize from the airline is (Give your answer in percentage correct to 2 decimal places.) Answer is 0.17%. P (Amy receiving a lucky draw prize) = P (Amy is on the selected flight AND she is one of the 10 chosen passengers on the flight) = 1/50 × 10/120 = 0.17%(correct to 2 decimal places.) 12. Camera lenses are made by two companies, A and B. Suppose that 60% of all lenses made are by Company A, while the rest are made by Company B. 7% of all lenses made by Company A are faulty, while 9% of all lenses made by Company B are faulty. One of the lenses was selected at random and observed to be faulty. What is the probability that it was produced by Company A? Give your answer as a number between 0 and 1, to the nearest 2 decimal places. Answer is 0.54. P(A | faulty) = P(A and faulty)/P(faulty). While P(A and faulty) = P(faulty | A) × P(A) = 0.07 × 0.6 = 0.042, we use the Law of Total Probability to compute P(faulty). That is, P(faulty) = P(faulty | A) × P(A) + P(faulty | B) × P(B) 5 = 0.07×0.6+0.09×0.4 = 0.078. Therefore, P(A | faulty) = P(A and faulty)/P(faulty) = 0.042/0.078 = 0.54. 13. A poll conducted before a local election gives a 95% confidence interval for the percentage of voters who support candidate X as (54%, 60%). Based on the same poll result, which of the following can potentially be a 99% confidence interval for the percentage of voters who support candidate X? (A) (56%, 58%). (B) (52%, 62%). (C) (54%, 60%). Answer is (B). A 99% confidence interval should be wider than the 95% confidence interval for the same sample. 14. Based on a random sample of 200 staff members in NUS, the 95% confidence interval for the proportion of all NUS staff who went on vacation for at least 5 days in 2018 is (0.33, 0.59). Which of the following statements must be true? (I) If another sample of size 500 is drawn using the same sampling method, for the same confidence level, the confidence interval will be wider than (0.33, 0.59). (II) A maximum of 59% of all NUS staff went on vacation for at least 5 days in 2018. (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). Answer is (C). Statement (I) is false since a bigger sample size will likely result in a narrower confidence interval for the same confidence level. Statement (II) need not be true as the interval (0.33, 0.59) is not guaranteed to contain the population proportion. 15. A researcher takes a random sample from Country X’s population to estimate its unemployment rate. From the sample, the researcher obtains the 95% confidence interval for the population unemployment rate to be between 0.18 and 0.22. Which of the following statements correctly interprets the results? (A) If many samples of the same size were collected using the same procedure, and their respective confidence intervals calculated in the same way, about 95% of these samples will have the sample unemployment rate lie between 0.18 and 0.22. (B) If many samples of the same size were collected using the same procedure, and their respective confidence intervals calculated in the same way, about 95% of these samples will have the population unemployment rate lie between 0.18 and 0.22. (C) If many samples of the same size were collected using the same procedure, and their respective confidence intervals calculated in the same way, about 95% of these samples will have the sample unemployment rate lie within the samples’ respective confidence intervals. (D) If many samples of the same size were collected using the same procedure, and their respective confidence intervals calculated in the same way, about 95% of these samples will have the population unemployment rate lie within the samples’ respective confidence intervals. 6 Answer is (D). By the interpretation of confidence intervals via repeated sampling, we conclude that about 95% of the samples will contain the population parameter (in this case the population unemployment rate) within their respective confidence intervals. 16. A nutritionist wants to estimate the average mass of a particular tortilla chip brand, Naritos. He decides to collect a sample of 100 packets of Naritos chips, using the method described as follows: out of all the factories that produce chips, he goes down to the nearest factory that produces Naritos chips on a particular day and measures the average mass of the first 100 packets of Naritos chips produced on that day. You can assume the mass of the packaging is negligible. Using the formula s x ± t∗ × √ , n he obtains a 95% confidence interval of [676.3, 723.7] for the population mean. Which of the following statements must be true? (I) If we collect 100 samples of 100 packets of Naritos chips using the same sampling method, about 95% of the samples will result in confidence intervals containing the population average mass of Naritos chips. (II) We can conclude that there is a 95% chance that the population average mass of Naritos chips lies within [676.3, 723.7]. (A) Only (I). (B) Only (II). (C) Neither (I) nor (II). (D) Both (I) and (II). Answer is (C). Recall that Sample statistic = population parameter + bias + random error. For a confidence interval to be valid, we need the assumption that bias is approximately zero, so that Sample statistic = population parameter + random error. However, in this scenario, a probability sampling method is not used to acquire the sample, so selection bias will interfere with the sample estimate of the population mean mass of Naritos chips. It is possible that many samples out of the 100 samples collected have a sample mean that overstates the true population mean by a large amount. For instance, it is possible that the first 100 packets of chips produced each day are a lot heavier than the average packet of chips. Thus we do not expect to see 95 out of 100 of the confidence intervals generated containing the true population mean. Hence, statement (I) is false. Furthermore, we do not use the word “chance” that the population mean lies within a given range. Thus statement (II) is also false. 17. A random sample of size 500 is taken from a population of 10000 people of age 50. From the sample, a 95% confidence interval for the population mean weight is constructed. Which of the following statements is/are correct? (I) The confidence interval will always contain the sample mean weight. (II) If many samples of the same size are collected using the same sampling method, about 5% of the confidence intervals from these samples will not contain the population mean weight. 7 (A) Only (I). (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (C). A confidence interval contructed from a sample will always contain the sample mean (while we cannot be sure if it will contain the population mean). For statement (II), the interpretation of confidence intervals via repeated sampling tells us that about 95% of the confidence intervals from these samples will contain the population mean, so about 5% of the intervals will not contain the population mean. 18. A 99% confidence interval for the mean height (in meters) of NUS students is [1.58, 1.80]. It is constructed using a random sample of 100 students. Using the same sample, which of the following is a plausible 95% confidence interval for the mean height? (A) [1.64, 1.86]. (B) [1.61, 1.77]. (C) [1.48, 1.89]. (D) [1.63, 1.85]. Answer is (B). From the 99% confidence interval [1.58, 1.80], we can infer that the = 1.69 and the margin of error is (1.80−1.58) = 0.11. sample mean is (1.80+1.58) 2 2 As we use the same sample to calculate the 95% confidence interval, the sample mean remains the same but the margin of error decreases as the confidence level is lower. Therefore, the lower bound of the 95% confidence interval should be larger than the lower bound of the 99% confidence interval. Similarly, the upper bound of the 95% confidence interval should be smaller than the upper bound of the 99% confidence interval. By checking the options given, [1.61, 1.77] is the only plausible answer. 19. A 95% confidence interval for the population mean income is constructed from a random sample. Which of the following statements is/are true? Select all that apply. (A) A smaller sample size is likely to result in a narrower 95% confidence interval. (B) For the same sample, a 90% confidence interval will be narrower. (C) There is a 95% chance that the population mean income will fall within the constructed confidence interval. (D) The confidence interval constructed contains the sample mean. (B) and (D) are correct. (A) is not correct since a smaller sample size is likely to result in a wider (not narrower) 95% confidence interval. (B) is correct since a 90% confidence level is lower than a 95% confidence level, the confidence interval constructed will be narrower than the one constructed at 95% confidence level. (C) is not correct because we do not use the word ‘chance’ that the population mean lies within a given range. It is either within the range or not. (D) is correct since a confidence interval constructed from a sample will always contain the sample mean. 20. A researcher working for the government wishes to estimate the average daily expense for food, for singles staying alone in HDB flats in a particular neighbourhood. He manages to obtain a list of all singles living in the neighbourhood and uses simple random sampling to pick 100 singles. The sample estimate obtained is $30.20 and the standard deviation is $5.62. Calculate the margin of error, correct to 2 decimal 8 places, for a 95% confidence interval of the population parameter. The corresponding t∗ value for the 95% confidence interval is 1.984. Based on the formula for the margin of error: s t∗ × √ , n the answer correct to 2 decimal places is 1.12. 21. Joseph wishes to know how everyone in his class fares for GEA1000 quiz 1. He sent a survey to everyone in the class asking for their quiz 1 scores and everyone responded truthfully. He then constructed a 95% confidence interval for the mean score with the data collected and found that the interval is [4.4, 6.8]. Which of the following statements must be true? (I) The constructed confidence interval may not contain the population mean score. (II) The population mean score can be found given the constructed confidence interval. (A) Only (I). (B) Only (II). (C) Both (I) and (II). (D) Neither (I) nor (II). Answer is (B). Note that Joseph reached out to the population of interest with 100% response rate. With a census, there is in fact no need to construct a confidence interval since the computed mean would be the population mean and not a sample mean. Nevertheless, if Joseph went ahead to construct a 95% confidence interval, the confidence interval would contain the population mean and (I) is false. The population mean is (4.4 + 6.8)/2 = 5.6 and therefore (II) is true. 22. Which of the following statements about the p-value is/are true? Select all that apply. (A) The p-value is the probability of obtaining results at least as favourable to the alternative hypothesis as the collected data, computed based on the assumption that the null hypothesis is true. (B) The p-value gives the probability that the null hypothesis is true. (C) At a 5% significance level, a p-value smaller than 0.05 provides sufficient evidence that the null hypothesis should be rejected. (D) At a 5% significance level, a p-value larger than 0.05 provides sufficient evidence that the null hypothesis should be rejected. (A) and (C) are correct. In a hypothesis test, the null hypothesis is assumed to be true, we neither prove nor provide evidence that the null hypothesis is true. A significance level is a threshold we set, so that any p-value below it indicates there is sufficient evidence to reject the null hypothesis. 23. Suppose we want to test if a coin is biased towards heads. We decide to toss the coin 10 times and record the number of heads. We shall assume the independence of coin tosses, so that the 10 tosses constitute a probability experiment. Let X denote the number of heads occurring in 10 tosses of the coin. We will carry out a hypothesis test with X as the test statistic. Let H be the event that the coin lands on heads, in a single toss. We set our hypotheses to be 9 H0 : P (H) = 0.5, H1 : P (H) > 0.5. Suppose in our execution of the 10 tosses, we observe 4 heads. This means X = 4 is the test result we observe. Recall the definition of p-value to be the probability of obtaining a test result at least as extreme as the one observed, assuming the null hypothesis is true. What is the range of test results “at least as extreme as the one observed”, in this scenario? (A) 0 ≤ X ≤ 4. (B) 4 ≤ X ≤ 10. (C) 0 ≤ X ≤ 5. (D) 5 ≤ X ≤ 10. Answer is (B). Note that in the context of p-value computation, “at least as extreme” is interpreted as “at least as favourable to the alternative hypothesis”. In this scenario, the greater the value X assumes, the more favourable the case is to the alternative hypothesis. Hence, the answer should be all values of X greater than or equal to the observed value. 24. A group of students wants to find out if there is any association between staying in a hall and being late for class in NUS in a particular month. If students are late for at least 5 classes, they are considered “late for class” in that month. After collecting a random sample of 1000 students, they found that 200 out of 350 students who stay in a hall are late for class, while 390 out of 650 students who do not stay in a hall are late for class. A chi-squared test was done to test for association between staying in a hall and being late for class at 5% level of significance. The p-value derived from the chi-squared test is 0.3809. Which of the following statements is/are true? Select all that apply. (A) There is a positive association between staying in a hall and being late at the sample level. (B) There is a negative association between staying in a hall and being late at the sample level. (C) Since the p-value is more than 0.05, we can conclude that there is an association between staying in a hall and being late at the population level. (D) Since the p-value is more than 0.05, we cannot conclude that there is an association between staying in a hall and being late at the population level. (B) and (D) are correct. To check for association between staying in hall and being late at the sample level, we compare rate(Late for class | Staying in hall) = < 200 350 390 = rate(Late for class | Not staying in hall) 650 Hence, we see that staying in hall is negatively associated with being late at the sample level. As we are doing a chi-squared test to see if there is an association at the population level, recall that the null hypothesis states that there is no association at the population level, while the alternative hypothesis states that there is an association at the population level. Since the p-value is greater than the level of significance, we do 10 not reject the null hypothesis. Hence, we cannot conclude that there is an association between staying in hall and being late at the population level. 25. Everyday, a food packaging line fills thousands of bags of sugar that are supposed to weigh 100g each. One day, the line supervisor suspects that the machines might be putting more sugar in each bag. He randomly selects some of the bags to perform a hypothesis test. What should his null hypothesis be? (A) Bags on average, weigh more than 100g. (B) Bags on average, weigh less than 100g. (C) Bags on average, weigh 100g. Answer is (C). The null hypothesis expresses the idea that “things” are behaving as expected, in this case, this means that the bags on average weigh 100g. 26. 25 mothers were each allowed to smell two articles of infant’s clothing. Each of them was then asked to pick the one which belongs to her infant. They were successful in doing so 72% of the time. You want to show that this has not happened by chance and mothers can indeed recognise the smell of their children. To test such a hypothesis, what should the null and alternative hypotheses be? (A) H0 : P(Success)= 0.5; H1 : P(Success)> 0.5. (B) H0 : P(Success)= 0.72; H1 : P(Success)> 0.72. (C) H0 : P(Success)= 0.5; H1 : P(Success)< 0.5. (D) H0 : P(Success)= 0.72; H1 : P(Success)< 0.72. Answer is (A). To show something does not happen by chance, we should have the null hypothesis implying that it happens by chance, and the alternative hypothesis implying that it does not. This is so that we can gather evidence from a probability experiment to reject the null for the alternative. In this case, the null hypothesis is that the chance that a mother is able to recognise her child smell correctly is 50%. The alternative hypothesis should be in line with what we want to show, hence the alternative hypothesis is H1 : P(Success)> 0.5. 27. Jane suspects that sleep duration is associated with a child’s mathematical ability. She conducts a study to test her hypothesis. Which of the following should be her null hypothesis? (A) There is insufficient evidence to conclude sleep duration is associated with a child’s mathematical ability. (B) Sleep duration is not associated with a child’s mathematical ability. (C) Sleep duration is associated with a child’s mathematical ability. (D) There is sufficient evidence to conclude sleep duration is associated with a child’s mathematical ability. Answer is (B). In hypothesis tests involving the association between two variables, the null hypothesis expresses the idea that a variable is not associated with another variable. Thus in this case, the null hypothesis asserts that sleep duration is not associated with a child’s mathematical ability. 28. A student conducts a one-sample t-test, at a 5% level of significance, in the following way: Null hypothesis H0 : µ = 20. Alternative hypothesis H1 : µ > 20. 11 Which of the following statements is true? (A) If the level of significance was increased to 10%, the p-value will remain the same. (B) If the p-value > 0.05, H0 should be rejected. (C) If H1 was changed to µ < 20, the level of significance will decrease. (D) If H0 was rejected at a 5% level of significance, H0 will also be rejected at a 1% level of significance for the same data set. Answer is (A). The level of significance is the ‘cutoff value’ decided by the researchers, and hence does not impact the p-value in any way, nor should it change when the alternative hypothesis changes. If H0 was rejected at a 5% level of significance, it will also be rejected at a higher level of significance, not necessarily at a lower level of significance. H0 should not be rejected if p-value ≥ 0.05. 29. Jasmine is a sports psychologist who wants to investigate if participation rates in competitive sports amongst Austrian youths under 12 years old is associated with sports funding to primary schools. She collects her data from a random sample of 16-year-old Austrian students, in the following format: Subject Gender 001 002 ··· ··· M F ··· ··· Did you participate in any sports competition before you were 12 years old? (Y/N) Yes Yes ··· ··· Was there sports funding in your primary school? (Y/N) No Yes ··· ··· Select the most appropriate null hypothesis in her study. (A) There is no association between Gender and Participation in sports competition before 12 years old among 16-year-old Austrian students. (B) There is an association between Gender and Participation in sports competition before 12 years old among 16-year-old Austrian students. (C) There is no association between Participation before 12 years old and Sports Fundings among 16-year-old Austrian students. (D) There is an association between Participation before 12 years old and Sports Fundings among 16-year-old Austrian students. Answer is (C). The first paragraph tells us that the researcher is interested in finding the relationship between Participation before 12 years old and providing Sports Fundings. So, (A) and (B) are not suitable null hypotheses. (D) is the alternative hypothesis for this study. 30. Professor Ali, a sleep researcher, claims that the average amount of sleep that students in school XYZ get each day is 7 hours. Professor Lily believes that it is less than that, as her students are always sleeping in her classes. She decided to conduct a hypothesis test at the 5% significance level. She randomly sampled 153 students from the school and obtained a sample average sleep duration of 5.5 hours. Which of the following outcomes must Professor Lily consider in calculating the p-value for her hypothesis test? (A) Sample average sleep duration of 7 hours. (B) Sample average sleep duration smaller than 7 hours. 12 (C) Sample average sleep duration greater than 5.5 hours. (D) Sample average sleep duration smaller than 5.5 hours. Answer is (D). The null hypothesis here is that the population mean sleep duration for students in school XYZ is 7 hours. The alternative hypothesis here is that the population mean sleep duration for students in school XYZ is less than 7 hours. From the sample, the test result observed is the sample average sleep duration of 5.5 hours. Hence, test results that are as extreme or more extreme than what was observed are sample averages of sleep duration that are smaller than or equal to 5.5 hours obtained from random samples drawn. 13