Uploaded by tina rams

gea1000 practice questions

advertisement
1. Drug A is a new drug created to treat Congenital Amegakaryocytic Thrombocytopenia
(CAMT). An experiment was done to evaluate survival outcomes of drug A, against the
current standard, Rituximab, for treating CAMT. 150 CAMT patients were assigned
to the drug A group, and 150 CAMT patients were assigned to the Rituximab group.
A portion of the experimental design is summarised in the table below.
Male
Female
Total
Drug A
82
68
150
Rituximab
85
65
150
Determine if the statement below is true or false:
“The table shows that random assignment was not done, because the two groups (drug
A and Rituximab) do not have the same number of females.”
The statement is false. Random assignment does not guarantee that the number of
females will be exactly the same in both groups (drug A and Rituximab). If random
assignment was done on a large number of subjects, the two groups will tend to be
similar (not necessarily the same) in all aspeccts.
2. In a study on the relationship between watching television and obesity, 3000 individuals were recruited via an advertisement put up by ABC newspaper in Singapore.
Afterwards, the investigator of the study separated the participants into two groups.
The participants in one group consists of people who, on average, watch television for
at least 4 hours a day, while the other group consists of people who, on average watch
television strictly less than 4 hours a day. The investigator then records how many
participants are obese in each group. Which of the following is/are true?
(I) The result of this study is generalisable to the population who reads ABC newspaper.
(II) This is an observational study.
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (B). The result is not generalisable because the investigator uses nonprobability sampling (volunteer sampling). The study is clearly an observational study
because the researchers are not involved in the assignment of the subjects to either
group, but rather is done by self assignment.
3. In a large scale experiment, a researcher randomly assigned 6000 subjects to receive
either a drug or a placebo. 4000 patients were assigned to receive the drug, and the
other 2000 patients received the placebo. The researcher did a quick headcount in the
drug-receiving group and noted that there were 3002 males who received the drug.
The researcher does not have time to do a headcount in the placebo group. Which is
the most reasonable number of males to be expected in the placebo group?
(A) 1000.
(B) 1500.
(C) 2000.
(D) 2998.
1
Answer is (B). Randomised assignment of a large number of subjects tend to produce
groups which are similar in all aspects (including the proportion of males in each
group). 6000 is a reasonably large number, so we would expect the proportion of
males and females in each group to be similar. Among the 4000 subjects who received
the drug, 3002 (about 75%) were males. Hence among the 2000 patients who received
the placebo, about 75% (1500) of them should be male.
4. Virus X has been known to cause very severe symptoms in its patients. Previously
there has been no anti-viral medicine to treat virus X. Recently, researchers have
finally managed to produce a trial drug in the form of a tablet. Researchers want
to investigate if the trial drug helps to reduce the duration of symptoms (number of
days) in patients. 1000 patients were sampled for the study, and all consented to join
the study.
Which of the following statements is/are true? Select all that apply.
(A) Random sampling should be done to ensure that the subjects’ demographics/characteristics are similar (in the treatment and control groups).
(B) Blinding the researchers to the subjects’ assigned groups (treatment or control
group) is important because the researchers may have certain bias for/against
the drug.
(C) If the study randomly assigns 400 subjects into the treatment group, and 600
subjects into the control group, the result of the study will be biased due to the
unequal number in the two groups.
Only (B) is correct. Random assignment (not random sampling) should be done to ensure that the subjects’ demographics/characteristics are similar in the treatment and
control groups. Furthermore, random assignment does not require the same number
of subjects in both treatment and control groups. As long as the number of subjects
is large, the treatment and control groups will likely have similar demographics/characteristics. In fact, since we are comparing rates and not numbers, it does not matter
if we have unequal group sizes.
5. A researcher has invited 500 people to participate in his study. He uses random assignment to assign the subjects into the Treatment and Control groups. The Treatment
group has 200 subjects, and the demographics of the 200 subjects are as follows:
Old
Young
Male
83
32
Female
18
67
The 300 remaining subjects are in the Control group. The researcher should expect
.
the number of young males in the Control group to be around
(A) 32.
(B) 48.
(C) 51.
(D) 173.
Answer is (B). When random assignment is conducted, we can expect the percentage
of males to be similar in both Treatment and Control grpups. We can also expect the
percentage of young people to be similar in both Treatment and Control groups. The
32
proportion of young males in the Treatment group is 200
= 0.16. We can expect a
similar proportion in the Control group, implying that we expect about 0.16×300 = 48
young males in the Control group.
2
6. Xiao Lian is a frequent TikTok user, and she wants to compare the effect of watching
TikTok dance videos versus watching maths videos on the happiness levels of all NUS
students.
She video calls 200 of her closest NUS friends to take part in her study, but only 100
of her friends responded. Half of these 100 friends were chosen at random to watch
a TikTok dance video while the other half were shown a maths video, and she asks
them to rate their happiness levels before and after watching the videos.
Which of the following statements is true?
(A) This is an observational study.
(B) This study is likely to contain selection bias.
(C) If all 200 of her friends responded, the study’s findings are generalisable to all
NUS students.
(D) Random assignment was conducted, so the study does not contain selection bias.
Answer is (B). This study is likely to contain both selection bias and non-response
bias. It is likely to have selection bias because Xiao Lian chose from only her friends,
and not the entire NUS population of students. For example, Xiao Lian might have
more friends in the same faculty as her as compared to other faculties, or they might
have the same interests (TikTok) as her. The study is likely to have non-response
bias because 100 of them did not respond. Random assignment is a technique used
to remove the effects of potential confounding variables. Random assignment will not
remove selection bias in a study.
7. A study was conducted to investigate if memory foam pillows helped to improve one’s
sleep quality. A large number of subjects were sampled via simple random sampling.
The study aimed to randomly assign the subjects into the treatment and control
groups. Thus, a fair coin was flipped for each subject to determine if the subject
should be assigned into the treatment or control group – if “heads”, the subject is
assigned into the treatment group, if “tails”, the subject is assigned into the control
group. Subjects in the treatment group received a memory foam pillow, while subjects
in the control group received a regular pillow. Below is the description of the number
of subjects between the two groups.
Sex
Male
Female
Total
Treatment
q
t
w
Control
r
u
x
Total
s
v
y
From the above table, which of the following statements is true?
(A) wq should be approximately equal to xr .
(B) q must be equal to r.
(C) s should be approximately equal to v.
(D) w must be equal to x.
Answer is (A) because we expect that if random assignment was done and the coin
is fair, we would see a similar proportion (may not be the exact proportion) of males
between both treatment and control groups. None of the other statements results from
random assignment using a fair coin or from simple random sampling.
8. Which of the following statements is true regarding observational studies and experimental studies?
3
(A) Observational studies involve manipulating variables to establish cause-and-effect
relationships.
(B) Experimental studies are only used to study rare events or long-term trends.
(C) Observational studies rely on random assignment to groups to control for confounding variables.
(D) Observational studies can establish associations.
Answer is (D). We are not able to manipulate variables in observational studies. The
use of experimental studies is not restricted to the study of rare events nor long-term
trends. Random assignment is not feasible in observational studies.
9. Patch Z is a new medicine created to remove muscle soreness. A study was done to
investigate the effectiveness of patch Z. The population of interest was Singaporean
adult males. For this study, the researchers requested the Singapore Sports Association
to sample all male athletes who reported for training over the week. 200 male athletes
were sampled. There was no non-response. The 200 subjects had their identity tags
randomly shuffled in a box. The first 100 tags picked from the box were assigned to
the treatment group - administered patch Z. The remaining 100 were administered a
placebo. 72% of the group that received patch Z had their muscle soreness alleviated,
while 34% of the other group had their soreness alleviated. Which of the following
is/are true?
(I) We are not able to generalise these results to the population of interest.
(II) Random assignment was not conducted.
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (A). The result cannot be generalised, because of the sampling method
and the fact that the sampling frame does not contain the population of interest.
The method is non-probability sampling, and the sampling frame contains only male
athletes, or those who went for training that week, but the population of interest is all
Singaporean adult males. Take note that random assignment was actually conducted.
10. A researcher is trying to study the happiness level of all current NUS students. Which
of the following is/are (an) example(s) of probability sampling methods?
(I) The researcher gets a list of all current NUS students’ emails from the administrative office and randomly selects 100 students’ emails. He then emails them a
link to a short e-survey. 50 students replied to his survey.
(II) The researcher invites all final year NUS students in his faculty to visit his lab
for a psychological test to determine their happiness level, with a promise to
compensate them for their time with a $10 voucher. 200 students turned up for
the test.
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
4
Answer is (A). Randomly selecting 100 emails from the list of student emails is a
probability sampling method, even if the response rate is not high. On the other
hand, the process of sending an email to only final year students in the researchers’
faculty is done out of convenience, and convenience sampling is a non-probability
sampling method.
11. To find out the employment status of fresh graduates from University ABC, a questionnaire was sent to all of them. 30% of fresh graduates responded to the survey.
The employment rate was calculated from the responses. Which of the following is
likely to cause the calculated rate to differ from the population rate?
(I) Selection bias.
(II) Non-response bias.
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (B). 30% of fresh graduates may not be representative of the whole population. It is possible that those who did not respond may have different employment
status from those who responded. Although the response rate is low, it is still a census.
There is no selection bias in a census.
12. Suppose I wish to find the average intelligence quotient (IQ) of all Primary 5 children
studying in local schools in Singapore. I first selected a random sample of 10 schools
out of all local primary schools in Singapore. Then I asked all the Primary 5 children
in these chosen 10 schools to take an IQ test. Finally, I obtained the average value of
all the IQ scores of children who took the test, which was 106. Which of the following
statements is/are correct?
(I) The parameter in this study is the average IQ of all Primary 5 children who took
the IQ test.
(II) Stratified sampling was employed in this study.
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (D). The parameter in this study is the average IQ of all Primary 5 children
studying in local schools in Singapore. 106 is a sample estimate of the actual
parameter. Hence statement (I) is incorrect. In stratified sampling, the population is
divided into groups (strata) and then we randomly obtain a sample from each group.
In cluster sampling, the population is first divided into groups (clusters). Then we
take a random selection of clusters from all clusters, and include all units in the chosen
clusters to comprise our sample. Here, cluster sampling is employed, where each school
is a cluster. Hence statement (II) is also incorrect.
13. An airline would like to find out the quality of service provided by their staff on one
of its flights. The airline posted an interviewer right after the plane landed and told
the interviewer to start her interview when the third passenger came out from the
plane. Thereafter, she would interview the next customer who came out from the
5
plane each time she had finished interviewing the previous customer. What is the
sampling method employed by the interviewer?
(A) Systematic sampling.
(B) Simple random sampling.
(C) Cluster sampling.
(D) None of the other given options.
Answer is (D). This sampling method is not a probability sampling method since there
is no probability involved in selecting passengers to be interviewed. All the methods
given in the other options are probability sampling methods.
14. In a drug factory, pills were manufactured in 1000 batches, with 20 units per batch,
forming a total of 20000 units. You decide to sample some of these pills to ensure that
the dosage is right. Suppose you randomly sample two batches and then select every
unit in these batches to be in your sample. What sampling method did you use?
(A) Systematic sampling.
(B) Stratified sampling.
(C) Cluster sampling.
(D) Simple random sampling.
Answer is (C). Since every unit from the randomly selected batches (the clusters) are
included in the sample, this is an example of the cluster sampling method.
15. Assume you have a sampling frame of your entire population of interest, which comprises of 100 people’s names. Which of the following methods can be used to select a
simple random sample of 10 people from this population?
(I) Assign, without replacement, to each person a random number from 1 to 100 such
that none of them share the same number. Choose the people assigned numbers
1 to 10.
(II) Write the names on equal sized pieces of paper, put the papers in a hat. Shake
the hat, mix the papers well and draw out 10 names.
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (C). In simple random sampling, every unit in the population of interest
must have the same chance of being selected in the sample.
16. Caleb resides in Stamford Hall, which comprises 400 university students. He wishes
to understand the association between Cumulative Average Point (CAP) and sleeping
hours among students residing there. There are a total of 100 rooms, with rooms
numbered 1 to 100, and each having 4 resident students. Rooms 1 to 50 contain male
residents, while rooms 51 to 100 contain female residents. Caleb picks all rooms that
are multiples of 5 (i.e., rooms 5, 10, 15, . . ., to 100), and all students in the selected
rooms were asked to do the study.
Which of the following must be true?
(I) The above is an example of systematic sampling.
6
(II) The response rate for the study is 100%.
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (D). It is not stated explicitly that the selection of the rooms is done via a
randomised mechanism. Therefore, the sampling process is non-probability in nature.
Also, it is not stated explicitly if all who were asked to do the study will respond.
Therefore, the response rate for the study may not necessarily be 100%.
17. A researcher wants to know the average weight of year 1 students in University A. The
researcher does not have access to such information, hence he decided to do a survey.
All University A year 1 students have to take a compulsory module in the first semester
of their studies, and hence have to be present for an in-person examination on 24th
April at 1pm. The researcher stood outside the examination venue’s only exit with a
weighing scale and waited for the examination to end.
There were too many students for the researcher to weigh. Hence, to decide whom
to weigh, while students were exiting, the researcher used a random integer generator
to produce a random integer for each student. If the random integer was even, the
researcher will measure the student’s weight. If the random integer was odd, the researcher will not measure the student’s weight. Assume that all students exited the
venue orderly in a line and were compliant with the researcher. There were 800 students exiting the venue, and the random integer generator produced 200 even numbers
and thus 200 students were weighed.
What is an/are issue(s) that the study is likely to face? Select all that apply.
(A) Bias present due to the non-probability sampling method.
(B) Bias present due to the random integer generator producing unlikely random
numbers.
(C) Bias present due to a poor sampling frame chosen.
(D) None of the other options.
Answer is (D). The study uses a form of probability sampling, where every unit has a
non-zero and known probability of being selected. Also, the random integer generator
is not always expected to produce 50% odd numbers and 50% even numbers, as it
depends on the settings. The population of interest matches the sampling frame for
this situation - the year 1 students.
18. A class of 150 men and 250 women is seated in an examination hall that has 25 rows
of chairs, with 16 chairs in each row. The men are seated in chairs numbered 1 to
150, and the women are seated in chairs numbered 151 to 400. Which of the following
scenarios will NOT produce a simple random sample of students from this class of
400? Select all that apply.
(A) Choose a row at random and select all the students in that row.
(B) Use a random number generator to generate 10 integers from 1 to 400. Select the
students seated in chairs corresponding to these numbers.
(C) Use a random number generator to generate 10 integers from 1 to 150, and 10
integers from 151 to 400. Select the students seated in chairs corresponding to
these numbers.
7
(D) Randomly select a letter from the English alphabet. Select for the sample students whose family names begin with that letter. If no family name begins with
that letter, randomly choose another letter from the alphabet.
(A), (C) and (D) will not produce a simple random sample of students. Recall that
a SRS is a sample such that any sample of size n is equally likely to be chosen. (A)
does not produce a SRS, since not every group of size 16 has the same chance of being
selected. For example, students from different rows have no chance of being selected
together. (B) produces a SRS, since every group of size 10 has the same chance of
being selected. (C) does not produce a SRS, since not every group of size 20 has the
same chance of being selected. For example, a group of 20 women cannot all be in the
sample. This is actually a stratified sample, with sex as strata. (D) does not produce
a SRS, since any two students with family names starting with different letters cannot
both be in the sample.
19. Which of the following scenarios involve(s) the use of probability in the sampling
process?
Select all that apply.
(A) A student wants to know how receptive bus commuters in Singapore are to the
recent announcement of a price increase. To obtain a sample, he went to a
nearby bus interchange and looked for commuters wearing white clothing (his
favourite colour). For each commuter wearing white, if the commuter was wearing
spectacles, he approached the commuter. Otherwise, he did not approach the
commuter. We may assume that every commuter that he approached completed
the survey.
(B) An event organiser wants to survey a sample of the participants of his event to find
out if they liked the activities he planned. He announced to all the participants
that anyone who completed the survey would win a prize with a probability of
0.8. The sample was made up of 75 participants who responded to the survey.
(C) The principal of a primary school wants to select a sample of his primary one
students to find out how they are coping with formal school education. There are
10 primary one classes and each class has 42 students. The principal went into
each class and rolled a six-sided die. If the die showed k (k = 1, 2, 3, 4, 5 or 6), then
students in the class with register numbers k, k + 6, k + 12, k + 18, k + 24, k + 30
and k + 36 would be included in the sample. A final sample of 70 students, 7
from each class, was formed.
(D) A researcher is trying to study how much time students from the Faculty of
Science sleep every day. He obtained a list of all Science students from the administration and for each student on the list, he generated a random number from
1 to 10. If the number generated was less than 7, the student was not selected.
If the number was 7 or more, the student was selected, and the researcher sent
an email to the student with a few survey questions. A total of 700 students
received the survey email and 500 of them replied.
(C) and (D) are correct as both employ a randomised mechanism in the selection
process. (B) is non-probability sampling (volunteer sampling) while (A) is also nonprobability sampling as the student conveniently went to a nearby bus interchange
(convenient sampling).
20. The Registry of Marriages is interested to see the relationship between the ages of
husbands and wives in City X. They randomly sampled 1000 pairs of husbands and
wives from the population of City X and obtained data of their ages (in years). Looking
through the data, they found that men always marry women who are younger than
8
them.
Based only on the information given above, which of the following statements must
be true?
(I) The average age of the husbands is more than the average age of the wives.
(II) The standard deviation of husband’s age is more than the standard deviation of
wife’s age.
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (A). We cannot tell anything about the spread of either the husbands’ ages
or the wives’ ages just by knowing that men are marrying women younger than them.
If all husbands are older than their wives, then it follows that the average age of the
husbands is going to be more than the average age of the wives.
21. A teacher has just finished marking the final examination scripts for her class of 50
Secondary 1 students. She informs the students that the class average is 67.3. The
maximum mark for the examination is 100 and the passing mark is 50. A student
receives his examination script and realises his score is 65 which is lower than the
average score. Based only on the information given above, which of the following
statements must be true?
(I) The student has performed worse than half the class.
(II) Everyone in the class has passed the test.
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (C). A score lower than the mean does not imply that the student has
performed worse than half the class. Consider the following set of scores for 10 students
45, 55, 60, 62, 64, 64, 65, 85, 86, 87.
The average is 67.3. The student who has scored 65 clearly has not performed worse
than half the class. Neither has everyone in the class passed the test since one student
has scored 45. A similar data set can also be constructed for 50 students. Therefore,
neither statement is true.
22. We have learnt that the standard deviation and interquartile range (IQR) are examples
of summary statistics that help us to quantify the spread of data points. However, they
are not the only ways of quantifying spread and there are other summary statistics
that can also help us to do this. For a numerical variable x, we can define the Mean
Absolute Deviation (commonly abbreviated as MAD) using the formula
|x1 − x| + |x2 − x| + · · · + |xn − x|
,
n
where x1 , x2 , . . . , xn are values for the variable in a data set and n is the number of
data points in the data set. The MAD is sometimes used in place of the standard
deviation as a measure of quantifying the spread of the data. Based on the above
formula, which properties must the MAD possess? Select all that apply.
Mean Absolute Deviation of x =
9
(A) The MAD cannot take a negative value.
(B) The MAD does not change when a constant is added to all the data points.
(C) The MAD does not change when a constant is multiplied to all the data points.
(D) If the MAD is zero, then all the values of x1 , x2 , . . . , xn in the data set are the
same.
(A), (B) and (D) are correct. Based on the formula above, the MAD behaves very
similarly to the standard deviation. Since we are taking absolute values, the MAD
can never be negative.
If a constant is added to all the data points, then the constant is also added to the
mean of the new data therefore the absolute difference between each point and the
mean continues to remain the same.
When we multiply a constant to all the data points, the mean is also multiplied by
the same constant, therefore the difference between each point and the mean is also
multiplied by the same constant. Hence, the MAD is multiplied by the constant. The
constant can be numbers other than 1 or −1, so the MAD can change.
Finally, if the MAD of x is zero, it means
|x1 − x| + |x2 − x| + · · · + |xn − x|
= 0,
n
which means
|x1 − x| + |x2 − x| + · · · + |xn − x| = 0.
Since the absolute value of any number cannot be negative, we are adding numbers
which cannot be negative to give us 0. Therefore, each number can only be zero which
means every data point is equal to the mean.
23. Suppose X is a numerical variable and the following are 10 data points for this variable.
4, 7, 4, 14, 10, 11, 17, 3, 8, r,
where r is a positive whole number that is unknown. Which of the following statements
is/are always correct? Select all that apply.
(A) If the mean is greater than 8 then r must be greater than 2.
(B) If r is greater than 2, then the median must be greater than 8.
(C) The mean is always greater than the median regardless of the value of r.
(D) The mode is always greater than the median regardless of the value of r.
Only (A) is correct. The sum of the 10 data points is
4 + 7 + 4 + 14 + 10 + 11 + 17 + 3 + 8 + r = 78 + r.
If the mean is greater than 8, then 78 + r must be greater than 80, which implies r
must be greater than 2. Arranging the 9 numbers excluding r in increasing order, we
have
3, 4, 4, 7, 8, 10, 11, 14, 17.
Note that the median is a number m where 50% of the numbers are smaller than m.
For example, if r = 3, the median is 7.5 thus (B) is incorrect. From above, since r is
at least 1, the mean is at least 7.9. But if r = 10 for example, then the median is 9
which is higher than the mean. So (C) is incorrect. (D) is also incorrect since if r = 4,
the mode is 4 while the median is 7.5.
10
24. Consider a data set consisting of values for a numerical variable x. Let the values
be x1 , x2 , . . . , xn arranged in ascending order. A value y is said to be the balancing
point of x in the data set if the following condition is satisfied.
(y − x1 ) + (y − x2 ) + · · · + (y − xk ) = (xk+1 − y) + (xk+2 − y) + · · · + (xn − y)
where x1 , x2 , . . . , xk are the values of x in the data set that are smaller than or
equal to y and xk+1 , xk+2 , . . . , xn are the values of x in the data set that are larger
than y. For example consider a small data set {1, 3, 5, 5, 5, 7, 9}. In this case the value
5 is the balancing point of the data set since
(5 − 5) + (5 − 5) + (5 − 5) + (5 − 3) + (5 − 1) = (7 − 5) + (9 − 5).
Which of the two statements below is/are true?
(I) The median of x is always the balancing point of x in any data set.
(II) The mode of x is always the balancing point of x in any data set.
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (D). Neither median nor mode is always the balancing point of a data set.
Consider a small data set {1, 1, 3, 4, 5}. The median of the data set is 3. However, we
observe that (3 − 1) + (3 − 1) is not the same as (4 − 3) + (5 − 3). Similarly, the mode is
1 but once again we see that (1−1)+(1−1) is not the same as (3−1)+(4−1)+(5−1).
25. City planners wanted to know how many people lived in a typical housing unit so they
compiled data from hundreds of forms that had been submitted in various city offices.
Summary statistics are shown in the table below.
Mean
2.53
Standard Deviation
1.4
Min
1
Q1
1
Median
2
Q3
3
Max
8
The city bases their garbage disposal fee on the occupancy level of the home or apartment. The annual fee is $50 plus $4 per person, so a single-occupant home pays $54
and homes with 10 people pay $50 + $4 ×10 =$90 a year.
The median fee paid is
(2)
(1)
and the IQR of the fee paid is
.
Fill in the blanks for the statement above, give your answers correct to 2 decimal
places.
The answer to the first blank is 50 + 2 × 4 = 58 since the median number of occupants
is 2. The IQR of the fee paid is (50 + 3 × 4) − (50 + 1 × 4) = 8.
26. An examination was given to Class A and Class B, which consisted of 20 students each.
The minimum possible score for the examination is 0 and the maximum possible score
is 100. The range of scores in Class A is from 75 to 95 marks. All the students in
Class B scored less than or equal to 50 marks. Due to shortage of teachers, Class A
and Class B were combined to form Class C.
Consider the following statements:
(I) The median score in Class C must be greater than the median score in Class B.
11
(II) The interquartile range of scores in Class C must be greater than the interquartile
range of scores in Class A.
Which of the statements is/are true?
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (C). Since Class A and Class B have the same number of students, and all
students in Class A scored strictly greater than the maximum score of Class B, the
median for Class C will be greater than the maximum score of Class B. Hence the
median of Class C will be greater than the median of Class B. The largest possible
interquartile range for Class A is 95 − 75 = 20. The smallest possible interquartile
range for Class C is 75 − 50 = 25. Hence the interquartile range for Class C must be
greater than the interquartile range for Class A.
27. Which of the following statements is true?
(A) The 3rd quartile is a measure of central tendency.
(B) The 2nd quartile is a measure of central tendency.
(C) The 3rd quartile is a measure of dispersion.
(D) The 2nd quartile is a measure of dispersion.
Answer is (B). The 2nd quartile is the median, which is a measure of central tendency.
The 3rd quartile is neither a measure of central tendency nor a measure of dispersion.
28. A multiple-choice mid-term examination was conducted for 2000 students in a General
Education Module GEB1000. There were 20 questions. Students were awarded 1 mark
for each correct answer and received 0 mark for any wrong answer. There was no
partial credit awarded for all questions. A teaching assistant helped with the collation
of the scores of the paper, and provided the following summary statistics:
ˆ Minimum = 2.0.
ˆ 1st quartile = 7.5.
ˆ Median = 11.5.
ˆ Mean = 9.0.
ˆ Mode = 12.0.
ˆ 3rd quartile = 13.2.
ˆ Maximum = 20.0.
Which of the following statements is/are true? Select all that apply.
(A) The 3rd quartile is incorrect.
(B) Based on the above information, we can conclude that the range is 18.0.
(C) Based on the above information, we can conclude that the coefficient of variation
is 2.
(D) None of the other statements is true.
12
(A) and (B) are correct. (A) is correct since the score of the exam is between 0 and
20 inclusive, with no partial credit given for any question, it is not possible for the 3rd
quartile to be 13.2. It is worth noting though, that it is possible for the 1st quartile,
median or 3rd quartile to be of the form x + 0.5, where x is an integer. For example,
consider the following values for simplicity: 1, 2, 3, 4, 5, 7, 12, 13, 15. To derive the
3rd quartile, we look at the values above the median. That is, 7, 12, 13 and 15. The
median of these four values is the average of 12 and 13, which is 12.5. (B) is correct
as the range is the difference between the minimum and maximum values, which is
20.0 − 2.0 = 18.0. (C) is incorrect. The coefficient of variation (CV) is Standard
Deviation divided by Mean. Since we do not have any information about the standard
deviation, we are unable to determine the value of CV.
29. There are 20 students in each of the groups: A and B. All the students in both groups
took a test. The minimum possible score is 0 and the maximum attainable score is
50. The range of scores obtained from the students in group A was 28 to 40. All
the students in group B scored at least 42 marks. Let M be the average value of the
combined scores of all the 40 students. Which of the following statements must be
true?
(A) M is equal to the average score in group B.
(B) M is higher than the average score in group A.
(C) M is lower than the average score in group A.
(D) There is insufficient information to deduce the relationship between M and the
average score in group A or group B.
Answer is (B). Since group A and group B have the same number of students, and
all the students in group A scored strictly lower than the lowest score in group B, it
implies that the average for group A is strictly lower than that for group B. Therefore,
when the scores of both groups are combined, the overall average score will be the
‘sum of the average score in group A and the average score in group B’ divided by
2, which will be strictly greater than the average score in group A and strictly lower
than the average score in group B.
30. A study was done to determine the relationship between perceived obstetric violence
and the risk of postpartum depression (PPD). A total of 782 women were asked to
report on their baseline characteristics including the following: Age, Education level
(Secondary School/ Pre-university / University), Family monthly wage (Euros), and
Nationality (Spanish / Portuguese / French / Others). Which of the following statements is true about the types of variables of Age, Education level, Family monthly
wage and Nationality, respectively?
(A) Numerical, Categorical ordinal, Numerical, Categorical ordinal
(B) Numerical, Categorical ordinal, Numerical, Categorical nominal
(C) Numerical, Categorical nominal, Numerical, Categorical ordinal
(D) Categorical ordinal, Categorical nominal, Categorical ordinal, Categorical nominal
Answer is (B). Education level and Nationality are categorical variables. There is some
natural ordering for Education level but not for Nationality. Therefore, Education level
is categorical ordinal and Nationality is categorical nominal. Age and Family monthly
wage are numerical variables.
13
1. The figure below shows that out of every 9 Singaporeans, 1 of them has diabetes.
Similarly, out of 10 Singaporeans over 60, 3 of them have diabetes. Let us define “over
60” as old and “60 and below” as young. Which of the following statements is/are
true? Select all that apply.
(A) rate(Diabetes | Young) > rate(Diabetes | Old).
(B) rate(Young | Diabetes) < rate(Young | No diabetes).
(C) rate(Old | Diabetes) < rate(Old | No diabetes).
(D) rate(Diabetes | Young) < rate(Diabetes | Old).
(B) and (D) are correct. From the above, we can see that the overall rate of diabetes
is 1/9 = 0.111. Additionally, we also know that rate(Diabetes | Old) = 0.3. Using
the basic rule of rates, we can deduce that rate(Diabetes | Young) must be less than
0.111, and by extension, less that rate(Diabetes | Old).
Since rate(Diabetes | Young) < rate(Diabetes | Old), it must also mean that rate(Young
| Diabetes) < rate(Young | No diabetes).
Since rate(Diabetes | Old) > rate(Diabetes | Young), we can also conclude that
rate(Old | Diabetes) > rate(Old | No diabetes).
2. In a survey conducted at a primary school, where 1000 students responded, 55% of
the respondents indicated that they sleep more than 8 hours every day. Among those
who sleep more than 8 hours every day, 70% wear spectacles. We also know that 50%
of the respondents wear spectacles. The percentage (computed to 3 significant figures)
of respondents who wear spectacles among those who sleep at most 8 hours every day
.
is
By the basic rule of rates, since 50% of all respondents wear spectacles, and 70%
among those who sleep more than 8 hours every day wear spectacles, the percentage
among those who sleep at most 8 hours every day must be less than 50%. In fact,
this percentage can be calculated exactly. Consider the following contingency table.
Figures that can be derived from the question are inserted. The rest of the table can
be completed in the order (1), (2), (3).
1
Sleep more than 8 hours/day
Sleep at most 8 hours/day
Column total
Wear spectacles
0.7 ∗ 550 = 385
(2) 500 − 385 = 115
500
Do not wear spectacles
(1) 550 − 385 = 165
(3) 450 − 115 = 335
500
Row total
550
450
1000
Among those that sleep at most 8 hours every day, the percentage of respondents who
wears spectacles is 115/450 ∗ 100% = 25.6%.
3. The contingency table below shows the income level for a group of adults classified by
gender.
Gender
Male
Female
Total
Low
23
26
49
Income level
Middle High
30
29
28
20
58
49
Total
82
74
156
The marginal rate, rate(Middle income), is calculated to be
(1)
the joint rate, rate(Female and Low income), is calculated to be
Give each answer as a percentage correct to 2 decimal places.
(2)
%; while
%.
To calculate the marginal rate, rate(Middle income), we take the column total of
all Middle income persons divided by the grand total of everyone in the sample, i.e.
58/156 ≈ 37.18%. Thus the answer to (1) is 37.18.
Then, to calculate the joint rate, rate(Female and Low income), we take the count of
“Females with Low income” divided by once again the grand total of everyone in the
sample, i.e. 26/156 ≈ 16.67%. Thus the answer to (2) is 16.67.
4. An experiment was done to evaluate the survival outcomes of a new drug X, against the
current standard Imatinib, for treating Chronic Myeloid Leukemia (CML). 459 CML
patients were assigned to two groups. One group was put on treatment X and the
other on Imatinib treatment, over a course of 36 months. A portion of the experiment
results is summarised in the table below.
Male
Female
Total
X
Patients Deaths
100
6
125
17
225
23
Imatinib
Patients Deaths
104
12
130
15
234
27
Which of the following statements is/are true? Select all that apply.
(A) The assignment of patients to the two groups could not have been random because
the two groups do not have the same number of patients.
(B) Being male is positively associated with the survival from CML.
(C) Treatment X (compared to Imatinib) is positively associated with survival from
CML.
(D) Sex is a confounder in the relationship between treatment type and survival from
CML.
(B) and (C) are correct. Random assignment does not guarantee that the number of
patients will be exactly the same in the different treatment groups.
rate(Survival | Male) = 100+104−12−6
= 186
100+104
204 = 0.91.
rate(Survival | Female) = 125+130−17−15
= 223
125+130
255 = 0.87.
2
Since rate(Survival | Male) is greater than rate(Survival | Female), being male is
positively associated with survival.
= 202
rate(Survival | X) = 225−23
225
225 = 0.90.
= 0.88.
rate(Survival | Imatinib) = 234−27
234
Since rate(Survival | X) is greater than rate(Survival | Imatinib), treatment X is positively associated with survival.
4
104
rate(Male | X) = 100
225 = 9 = 234 = rate(Male | Imatinib).
So sex and treatment type are not associated. As sex is not associated with treatment type, it cannot be a confounder in the relationship between treatment type and
survival. Recall that a confounder must be associated with both the independent and
dependent variables.
5. Consider the following figure, which depicts the proportion of working husbands and
wives in Singapore. Suppose that couples with BOTH husband and wife working
are considered as “Dual-career” whereas all other scenarios are considered as “Nondual-career”.
Which of the following statements is/are true amongst married couples in the years
2010 and 2015? Select all that apply.
(A) The year 2015 is positively associated with dual-career couples.
(B) The year 2015 is positively associated with non-dual-career couples.
(C) rate(Dual-career | 2015) > rate(Dual-career | 2010).
(D) rate(Non-dual-career | 2010) > rate(Non-dual-career | 2015).
(A), (C) and (D) are correct. Note that
rate(Dual-career | 2015) = 53.8% > 47.1% = rate(Dual-career | 2010).
So the year 2015 is positively associated with dual-career couples. This also implies
that the year 2015 cannot be positively associated with non-dual-career couples as
event B cannot be both positively associated with event A as well as NA. Finally,
rate(Non-dual-career | 2015) = 27.7 + 6 + 12.4 = 46.1% < 52.9% = rate(Non-dual-career | 2010).
6. In this question, we are focussed only on the final year students in AMAS university
in a certain year. You are given 75% of all these students graduated, and that 60%
graduated among the female students. Which of the following statements must be
true? Select all that are true.
3
(A) The percentage of graduates among the female students is less than the percentage of graduates among the male students.
(B) The percentage of female students among the non-graduates is greater than the
percentage of female students among the graduates.
(C) The percentage of male students among the graduates is less than the percentage
of male students among the non-graduates.
(D) The percentage of male students is greater than the percentage of male students
among the non-graduates.
(A), (B) and (D) are correct. Let G represent graduates, N G represent non-graduates,
M represent males, F represent females. Then we are given that rate(G) = 75% and
rate(G | F ) = 60%. By the basic rule on rates, since
rate(G) > rate(G | F ),
we must have
rate(G | M ) > rate(G) > rate(G | F ).
Hence the percentage of graduates among female students is less than the percentage
of graduates among male students. By symmetry rule, we have
rate(F | G) < rate(F | N G),
and so the percentage of female students among the non-graduates is greater than the
percentage of female students among the graduates. By symmetry rule, we also have
rate(M | G) > rate(M | N G),
and so it is incorrect to say that the percentage of male students among the graduates
is less than the percentage of males students among the non-graduates. Finally, by
the basic rule on rates, we must have
rate(M | G) > rate(M ) > rate(M | N G).
Thus the percentage of male students is greater than the percentage of male students
among the non-graduates.
7. Joseph conducted a study on night owls (individuals who sleep after 12am on average
every night) among staff and students in NUS. He gathered the following result on
rates:
rate(Night owl | Student) = 0.7;
rate(Non-Night owl | Staff) = 0.4.
Which of the following statements is correct?
(A) rate(Night owl) must be between 0.4 and 0.7.
(B) rate(Night owl) must be between 0.6 and 0.7.
(C) rate(Non-Night owl) must be between 0.4 and 0.7.
(D) rate(Non-Night owl) must be between 0.6 and 0.7.
Answer is (B). From rate(Non-Night owl | Staff)= 0.4, we obtain
rate(Night owl | Staff) = 1 − 0.4 = 0.6.
Since rate(Night owl | Student) = 0.7, applying the basic rule of rates, the overall
rate(Night owl) must be between these two rates at the subgroups. Thus rate(Night
4
owl) must be between 0.6 and 0.7. In addition, from rate(Night owl | Student)= 0.7,
we obtain
rate(Non-Night owl | Student) = 1 − 0.7 = 0.3.
Applying the basic rule of rate, the overall rate(Non-Night owl) must be between 0.3
and 0.4.
8. The relative frequency table below shows the distribution of annual total personal
income for the entire population of 6,402,386 living in City X. It is known that there
are 59% males and 41% females in City X.
Income
$9,999 or less
$10,000 to $14,999
$15,000 to $24,999
$25,000 to $34,999
$35,000 to $49,999
$50,000 to $64,999
$65,000 to $74,999
$75,000 to $99,999
$100,000 or more
Percent
2.2%
4.7%
15.8%
18.3%
21.2%
13.9%
5.8%
8.4%
9.7%
We are told that in City X, 71.8% of females earn less than $50,000 per year. Which
of the following statements is correct?
(A) There is positive association between being male and earning less than $50,000.
(B) There is negative association between being male and earning less than $50,000.
(C) There is no association between being male and earning less than $50,000.
(D) We do not have sufficient information to determine the correctness of the other
three statements.
Answer is (B). We need to figure out the relationship between rate(Less than 50,000
| Males) and rate(Less than 50,000 | Females). The latter is given in the question
as 71.8%. From the table, one can calculate the overall rate(Less than 50,000) to be
2.2 + 4.7 + 15.8 + 18.3 + 21.2 = 62.2%. So among everyone, the rate of those earning
less than 50,000 is 62.2%, while for females it is higher at 71.8%. By basic rule on
rates, the rate among males should be less than 62.2% and hence is less than 71.8%.
Thus
rate(Less than 50,000 | Males) < rate(Less than 50,000 | Females)
which implies that the association between male and earning less than 50,000 is negative.
9. Consider the following statements.
(I) A spokesman was quoted as saying that the proportion of a certain ethnic group
among those who contracted Disease X was lower than the proportion of that
ethnic group in the general population.
(II) A reporter interpreted the statement and concluded that the members of this
ethnic group are less likely to contract Disease X than a random member of the
population.
Which of the following is correct?
(A) The two statements are equivalent.
(B) We can infer statement (I) from (II) but not the other way round.
5
(C) We can infer statement (II) from (I) but not the other way round.
(D) We can neither infer statement (I) from (II), nor infer statement (II) from (I).
Answer is (A). Let M represent the particular ethnic group (N M represent not the
ethnic group), C represent contracting Disease X (N C represent not contracting Disease X). Statement (I) says that
rate(M | C) < rate(M ).
By the basic rule on rates,
rate(M | C) < rate(M ) < rate(M | N C).
This implies
rate(C | M ) < rate(C) < rate(C | N M ).
In particular,
rate(C | M ) < rate(C),
which is statement (II).
10. A school has some students, who are categorised as follows:
In a sports club
Not in a sports club
Male
20
30
Female
20
80
Which statement below is correct?
(A) There is no association between being in a sports club and sex, since the number
of male students in a sports club is the same as the number of female students
in a sports club.
(B) There is an association between being in a sports club and sex, since the number
of male students in a sports club is the same as the number of female students
in a sports club.
(C) There is no association between being in a sports club and sex, since the rate of
sports club students among males is different from the rate of sports club students
among females.
(D) There is an association between being in a sports club and sex, since the rate of
sports club students among males is different from the rate of sports club students
among females.
Answer is (D). To see if there is an association, we compare the rates of sports club
students among male and female students.
20
= 0.4;
rate(sports club students | male) = 50
20
rate(sports club students | female) = 100
= 0.2.
Since the rates are different, there is an association between being in a sports club and
sex.
11. Coriander is a common herb used to add flavour to various kinds of dishes. Suppose
we have two countries: Country X and Country Y . We have individuals who either
dislike coriander, or like coriander, and are either male or female. We know that in
country X,
ˆ rate(Dislike) = 0.1.
ˆ rate(Dislike | Male) = 0.3.
6
Also, in country Y ,
ˆ rate(Female) = 0.8.
ˆ rate(Female | Dislike) = 0.4.
Which of the following statements must be true? Select all that apply.
(A) rate(Male) < rate(Female) in country X.
(B) There is a positive association between disliking coriander and females in country
X and country Y separately.
(C) Overall rate(Male) is between 0.2 and 0.5, when both countries are combined.
Only (A) is correct. The Basic Rule on Rates states that
rate(A | not B) ≤ rate(A) ≤ rate(A | B).
As such, in country X, we note that rate(Dislike | Female) has to be between 0 and
0.1, since rate(Dislike) = 0.1 and rate(Dislike | Male)= 0.3. In any case, rate(Dislike)
is closer to rate(Dislike | Female) than it is to rate(Dislike | Male). Hence, rate(Male)
< rate(Female) in country X is correct.
In country X,
rate(Dislike | Male) > rate(Dislike | Female).
Therefore, there is a negative association between disliking coriander and being females
in country X. In a similar way, using the Basic Rule on Rates, we note that in country
Y,
rate(Female | Dislike) < rate(Female) < rate(Female | Not Dislike).
Therefore, there is a negative association between disliking coriander and being females
and statement (B) is incorrect.
Finally, in country Y ,
rate(Male) = 1− rate(Female) = 0.2.
On the other hand, we note previously in country X that females should form the
majority as rate(Female) > rate(Male). Hence, rate(Male) can be smaller than 0.2.
Thus the overall rate(Male) can be between 0 and 0.2 and statement (C) is incorrect.
12. A researcher is interested to find out if the residential location of a person is associated
with whether the person is a night owl (only sleeps after midnight). The variables of
interest are:
Location (East/West of Singapore).
Sleep Pattern (Night Owl/ Not Night Owl).
The researcher surveyed 1000 people in Singapore and the following is a table showing
the results from the researcher’s study.
Night Owl
Not Night Owl
East
120
450
West
330
100
From the table’s results, there is an association between ‘Location’ and ‘Sleep Pattern’
because
(A) rate(Night Owl | East) < rate(Night Owl | West).
(B) rate(East | Night Owl) < rate(West | Night Owl).
7
(C) rate(Night Owl | West) < rate(Night Owl | East).
(D) rate(Night Owl | East) < rate(East | Night Owl).
Answer is (A). To check for association, we compare rate(A | B) with rate(A | NB).
120
and rate(Night Owl | West) = 330
Here rate(Night Owl | East) = 570
430 . Since rate(Night
Owl | East) < rate(Night Owl | West), there is an association present between ‘Location’ and ‘Sleep Pattern’.
13. Jolie is interested to find out if there is any association between heights of individuals
and reading habits. She collected a sample of 60 individuals, all with unique heights.
Individuals whose height is equal to or greater than the median height among the 60
individuals are considered as “Tall”,while individuals whose height is lower than the
median height were considered as “Not Tall”. Individuals who complete at least 2
books a week were classified as “Frequent Reader”, while individuals who read less
than 2 books a week were classified as “Not Frequent Reader”. She then constructed
a 2 by 2 contingency table as follows:
Tall
12
Frequent Reader
Not Frequent Reader
Column Total
Not Tall
Row Total
21
Unfortunately, after constructing the above table on a piece of paper, she spilled coffee
on that paper, and she was only able to recover the above figures. What conclusion
could possibly be deduced from the above?
(A) It is impossible to deduce if “Frequent Reader” is associated with being “Tall”
as more information is needed.
(B) “Frequent Reader” is positively associated with being “Tall”.
(C) “Frequent Reader” is not associated with being “Tall”.
(D) “Frequent Reader” is negatively associated with being “Tall”.
Answer is (B). Consider the fact that a sample of 60 individuals were selected. Since
all individuals have unique heights, the median height would be between the 30th
and 31st individual when the individuals are lined up in ascending order of height.
Hence, the number of “Tall” people in total would be 30, while the number of “Not
Tall” people would be 30. Consequently, the number of “Tall” people who are “Not
Frequent Reader” would be 18, and the number of “Not Tall” and “Frequent Reader”
would be 9. As such,
rate(Frequent Reader | Tall) = 12
30 = 40%
while
9
rate(Frequent Reader | Not Tall) = 30
= 30%.
Since rate(Frequent Reader | Tall) is greater than rate(Frequent Reader | Not Tall),
there is a positive association between being “Frequent Reader” and “Tall”.
14. Consider the following contingency table that shows the smoking status and outcome
after 20 years (heart disease or not) for 270 people.
Smoker
Not Smoker
Column Total
Heart Disease
60
90
150
No Heart Disease
30
90
120
Row Total
90
180
270
The values of rate(Heart Disease) and rate(Smoker and No Heart Disease) are
(1)
and
(2)
, respectively. Give your answers as percentages correct to one decimal
place.
8
5
Rate(Heart Disease) = 150
270 = 9 = 55.6%. Rate(Smoker and No Heart Disease) =
30
1
270 = 9 = 11.1%.
15. The table below shows the income level (Low, Middle, High) of 100 adults classified
by gender.
Male
Female
Column Total
Low
10
15
25
Middle
30
20
50
High
15
10
25
Row Total
55
45
100
The proportion of middle-income earners among males is
as a percentage correct to one decimal place.
. Give your answer
Proportion of middle-income earners amongst males refers to
rate(Middle | Male).
This means we take the number of middle-income males, divided by the total number
of males. Thus
rate(Middle | Male) =
30
30
=
= 54.5%.
10 + 30 + 15
55
16. The employees of a particular company were surveyed on the average number of hours
they spend sitting down each day (between 1 and 6 hours, or more than 6 hours)
and whether their Body Mass Index classifies them as being obese. The graph below
summarises the results of the survey.
Based only on the information from this graph, determine which of the following
statements must be correct. For ease of notation, we will identify those who spend 1
to 6 hours sitting down as Group A and those who spend more than 6 hours sitting
down as Group B.
(I) Out of all the responses, there are more obese employees in group B than in group
A.
(II) It is more likely that a surveyed employee is from group B if he/she is not obese
as compared to if he/she is obese.
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
9
(D) Neither (I) nor (II).
Answer is (D). Statement (I) may not be correct. Rates do not equate absolute
numbers. A higher rate of obese people in group B as compared to that in group
A does not mean that there are more obese employees in group B than in group A.
Statement (II) is not correct. Since we have
rate(Obese | Group A) < rate(Obese | Group B),
by symmetry rule, we will have
rate(Group A | Obese) < rate(Group A | Not obese).
Hence statement (II) is not correct.
17. A professor wanted to know more about media consumption habits among the students
in his class. He took note of each student’s gender (“Male” or “Female”) and asked
the student which communication platform they used most often (either “Telegram”
or “Whatsapp”). It was found that rate(Male | Telegram) > rate(Male | Whatsapp).
Which of the following statements must be true?
(I) rate(Telegram | Male) > rate(Telegram | Female)
(II) rate(Female | Telegram) < rate(Female | Whatsapp)
(III) rate(Whatsapp | Female) < rate(Whatsapp | Male)
(A) Only (I).
(B) Only (I) and (II).
(C) All three statements are true.
(D) None of the statements is true.
Answer is (B). Given that we know from the question that Males and Telegram are
positively associated, by the symmetry rule, Statements (I) and (II) are true, while
Statement (III) is false.
18. The Singapore Department of Statistics publishes yearly figures for the proportion of
citizen singles in the population, across various subgroups like gender and age group.
The following is an abridged list of proportions of singlehood for the year 2022, across
their respective subgroups, obtained from the online database
(https://tablebuilder.singstat.gov.sg/table/TS/M810691):
Proportion of Singles (in %)
31.2
34.4
27.9
17.4
16.6
18.0
30-39 Years (Total)
30-39 Years (Males)
30-39 Years (Females)
40-49 Years (Total)
40-49 Years (Males)
40-49 Years (Females)
Let us consider the age group of 30-39 Years as people in their “thirties”, and the age
group of 40-49 Years as people in their “forties”. Based on the available information
above, which of the following statements is/are necessarily true? Select all that apply.
(A) Among those in their thirties, rate(Males) > rate(Females).
(B) Among those in their forties, rate(Males) < rate(Females).
10
(C) When both age groups are combined, rate(Singles | Males) > rate(Singles | Females).
(D) When both age groups are combined, rate(Singles | Males) < rate(Singles | Females).
(E) Gender is a confounder in the association between singlehood and age group.
(A) and (B) are correct. Among those in their thirties, rate(Singles)=31.2% is closer
to rate(Singles | Males)=34.4% than to rate(Singles | Females)=27.9%, therefore (A)
is correct. Among those in their forties, rate(Singles)=17.4% is closer to rate(Singles
| Females)=18.0% than rate(Singles | Males)=16.6%, therefore (B) is correct. When
both age groups are combined, using the Basic Rule on Rates,
16.6% < rate(Singles | Males) < 34.4%;
while
18.0% < rate(Singles | Females) < 27.9%.
As such, it is possible for rate(Singles | Males) = rate(Singles | Females), therefore (C)
and (D) are not necessarily true. By a similar reasoning, (E) is also not necessarily
true because gender is not necessarily associated with singlehood and therefore it is
not necessarily a confounder.
19. The following infographic was extracted from the Department of Statistics Singapore,
depicting the usual mode of transport amongst resident students travelling to school
in 2020.
Let us classify a student whose usual mode of transport to school is “Public Bus Only”,
“MRT/LRT and Public Bus Only”, “MRT/LRT Only” or “Other combinations of
MRT/LRT or Public Bus” as a student who takes public transport to school. If the
proportion of female resident students travelling to school who take public transport
and the proportion of male resident students travelling to school who take public
transport are the same, what is the proportion of students who take public transport
to school, out of the male resident students who travel to school? Give your answer
as a percentage correct to one decimal place.
Answer is 54.6%. From the infographic given, we see that the proportion of resident
students who take public transport to school is 22.0% + 7.0% + 23.0% + 2.6% =
54.6%. Furthermore, since the proportion of female resident students travelling to
school who take public transport and the proportion of male resident students travelling to school who take public transport are the same, by the third consequence of the
Basic Rule on Rates, the proportion of students who take public transport to school,
out of all the male resident students who travel to school is 54.6%.
20. The following line graph taken from the Ministry of Health website shows the median
wait times for admission at the Emergency Medicine Departments at five hospital lo11
cations (CGH, NTFGH, SGH, SKH, TTSH) in Singapore across seven days in October
2023.
We consider a hospital to have a long wait time on a particular day if the median wait
time for admission is greater than 10 hours. Fridays, Saturdays and Sundays are considered weekends, while the other days are considered weekdays. Based only on the line
graph, which of the following statements is/are necessarily true? Select all that apply.
(A) Weekdays are positively associated with long wait times at NTFGH.
(B) Weekends are negatively associated with long wait times at CGH.
(C) There is no association between weekdays and long wait times at SKH.
(D) Simpson’s Paradox is observed in the association between wait times and whether
a day is a weekday or a weekend.
(A) and (C) are true. We may form the following table from the line graph.
Long
Short
CGH
1
3
Weekday
NTFGH SGH SKH
3
1
0
1
3
4
TTSH
2
2
CGH
1
2
Weekend
NTFGH SGH SKH
0
0
0
3
3
3
At NTFGH, r(Long | Weekday) = 3/4 > 0/3 = r(Long | Weekend), so weekdays are
positively associated with long wait times. (A) is correct.
At CGH, r(Long | Weekday) = 1/4 < 1/3 = r(Long | Weekend), so weekdays are
negatively associated with long wait times. Hence (B) is incorrect.
At SKH, r(Long | Weekday) = 0/4 = 0/3 = r(Long | Weekend), so there is no
association between weekdays and long wait times. Thus (C) is correct.
At SGH, r(Long | Weekday) = 1/4 > 0/3 = r(Long | Weekend), so weekdays are
positively associated with long wait times.
At TTSH, r(Long | Weekday) = 2/4 > 0/3 = r(Long | Weekend), so weekdays are
positively associated with long wait times.
Combining all the subgroups, r(Long | Weekday) = 7/20 > 1/15 = r(Long | Weekend),
so the overall association between weekdays and long wait times is positive. Since
there is a positive association between weekdays and long wait times in three of the
five subgroups, Simpson’s Paradox is not observed, hence (D) is incorrect.
12
TTSH
0
3
21. The rate of lung cancer among females in Singapore is 40%, while the rate of lung
cancer among males in Singapore is also 40%. Researchers also discovered that the rate
of lung cancer among smokers in Singapore is 70%. Which of the following statements
is/are true?
(I) Sex is a confounder when discussing the relationship between smoking and lung
cancer.
(II) Lung cancer is positively associated with smoking.
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (B). We know that rate(Lung cancer | Females) is 40% which is also equal to
rate(Lung cancer | Males). Hence, lung cancer and being a male/female are not associated, which means that sex cannot be a confounder when looking at the relationship
between smoking and lung cancer.
By the basic rule on rates, we know that rate(Lung cancer) = 40%. Given that
rate(Lung cancer | Smokers) = 70%, we know that rate(Lung cancer | Non-smokers)
must be less than 40%. Since rate(Lung cancer | Smokers) is greater than rate(Lung
cancer | Non-smokers), we know that lung cancer is positively associated with smoking.
22. A researcher is interested in finding out if gender affects whether a student is lefthanded. The information on gender (Male/Female), master hand (Left/Right) and
IQ (Above average/Below or At average) of all students was collected. The data was
used to plot the figure below.
The researcher makes the following two statements.
(I) IQ can be a confounder when investigating the relationship between gender and
master hand.
(II) When finding out if IQ affects whether a person is left-handed, it is possible that
Simpson’s Paradox is observed in the data due to the third variable, gender.
Which of the following correctly describes the two statements above?
13
(A) Statement (I) is true but statement (II) is false.
(B) Statement (I) is false but statement (II) is true.
(C) Both statements are true.
(D) Both statements are false.
Answer is (D). From the graph, we see that
rate(Male | Above average IQ) = rate(Male | Below or at average IQ) = 0.6.
Alternatively, from the graph, we see that the ratio of male to female in the “Above
average” category is the same as in the “Below or at average” category. Hence, IQ
is not associated with gender. Since IQ is not associated with gender, it cannot be
a confounder when investigating the relationship between gender and master hand.
Thus statement (I) is false.
Since gender is not associated with IQ, it cannot be a confounder when studying the
relationship between IQ and master-hand. Since it is not even a confounder in the
first place, Simpson’s Paradox cannot be observed. Thus, statement (II) is false.
23. A researcher conducted an observational study to investigate whether smoking is associated with heart disease. 1000 participants were recruited in the study and the
researcher obtained the following result:
Smoker
Non-smoker
Heart disease
146
324
No heart disease
317
213
He also observed that 80% of the smokers were alcoholics, while 85% of the people
with heart disease were alcoholics. Consider the following statements:
(I) There is positive association between smoking and heart disease.
(II) Being an alcoholic is a confounder.
Based only on the information given above, which of the two statements must be true?
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (C). To conclude if being an alcoholic is a confounder, we need information
on the percentage of non-smokers who are alcoholic and the percentage of people who
are alcoholics among those who do not have heart disease.
From the table,
rate(Smokers | Heart disease) = 146
470 = 0.31
and
317
rate(Smokers | No heart disease) = 317+213
= 0.60.
Thus, in this example, smoking is negatively associated with heart disease, since the
rate of smokers is lower in the heart disease group as compared to the group with no
heart disease.
24. Consider a study that intends to examine whether the colour red makes children act
impulsively. A group of 500 children were assigned into two groups by the expert
opinion of a child psychologist; group Red if the psychologist pointed to the child,
14
and group Green if the psychologist did not. Each child is then led into a room that
has a big button in the colour of their group and labelled “DO NOT PRESS ME!”.
It is then recorded whether the child presses the button within 10 minutes. All the
children were each then given a candy for participating.
The children were also asked if they like candies. The following table summarises the
data. For instance, 22 children from group Red that pressed the button do not like
candies.
Pressed button
Did not press button
Like candies
Red Green
3
135
177
60
Does not like candies
Red
Green
22
1
38
64
Is liking candy a confounder in this study?
(A) Yes.
(B) No.
(C) There is insufficient information given to determine whether liking candy is a
confounder in this study.
Answer is (B). We have to check whether liking candy is associated with both the
colour of the group, and pressing the button.
3+177
rate(Like candies | Red) = 3+177+22+38
= 0.75.
135+60
rate(Like candies | Green) = 135+60+1+64
= 0.75.
Since the two conditional rates are the same, liking candy is not associated to the
colour of the group, and is hence not a confounder in this study. Another way to get
to this conclusion is to show that
rate(Red | Like candies) = rate(Red | Does not like candies) = 0.48.
25. To investigate whether controlling mothers tend to have obese children, a study was
conducted and it was found that controlling behaviours in mothers is positively associated with obesity in their children:
250
rate(Obese child | Controlling mom)= 350
= 71%;
rate(Obese child | Non-controlling mom)= 138
350 = 39%.
It was suspected that the health status of the father plays a role in the phenomenon,
so the sample was sliced according to whether the fathers were obese or not.
Obese father
Non-obese father
Overall
Non-controlling mother
Number Obese child
87
78
263
60
350
138
Controlling mother
Number Obese child
270
234
80
16
350
250
Which of the following statements is/are true?
(I) Obesity in the father is a confounder; furthermore, this is an example of Simpson’s
Paradox being observed.
(II) Obesity in children is positively associated with controlling mothers when the
father is obese.
15
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (A). Statement (I) is correct. Obesity in the father is a confounder since
87
= 24.9%;
rate(Obese father | Non-controlling mom)= 350
rate(Obese father | Controlling mom) = 270
350 = 77.1%.
So obesity in father is positively associated with controlling mother. Furthermore,
rate(Obese child | Obese father)= 78+234
= 87.4%;
357
rate(Obese child | Non-obese father)= 16+60
343 = 22.2%.
So obesity in child is positively associated with obesity in father. In addition, Simpson’s Paradox is observed since for the obese father subgroup, obesity in child is
negatively associated with controlling behaviour in mothers (87% vs 90%), and the
same is observed in the non-obese father subgroup. The trend is reversed when the
two subgroups are combined. Statement (II) is incorrect as stated above, obesity in
children is negatively associated with controlling mothers when the father is obese.
26. In a certain year, it is known that the prevalence of diabetes among Singapore residents
is 10% and the prevalence of diabetes among old (age 60 and above) Singapore residents
is 30%. It was suggested that sex is a possible confounder in the observed association
between age and diabetes among Singapore residents. After further analysis, the
researchers concluded that sex is not a confounder, and there is an association between
sex and age. Which of the following statements is/are true? Select all that apply.
(A) rate(Diabetes | Male) = rate(Diabetes | Female).
(B) rate(Male | Diabetes) = rate(Female | Diabetes).
(C) rate(Diabetes | Female) = 10%.
(A) and (C) are correct. Since sex is not a confounder in the study of association
between diabetes and age and sex is associated with age, thus there is no association
between sex and diabetes. Thus it is correct to say that rate(Diabetes | Male) is equal
to rate(Diabetes | Female). Consequently,
rate(Diabetes) = rate(Diabetes | Male) = rate(Diabetes | Female) = 10%.
On the other hand, we do not know how many diabetic residents are male or female,
so we cannot determine if rate(Male | Diabetes) is equal to rate(Female | Diabetes).
27. The graph below, depicting cases of COVID infection (titled “Local cases in the last
14 days by Age group and Vaccination status”), was published in a press release by
Singapore’s Ministry of Health on 21 July 2021.
16
Let us designate the age range of those 61 years or older as “seniors” , and all others
as “non-seniors”. Let us also consider the status of either full or partial vaccination
as “vaccinated”, and the status of being completely unvaccinated as “unvaccinated”.
What can be concluded based on the information given? Select all statements that apply.
(A) The rate of infection among seniors is lower than the rate of infection among
non-seniors.
(B) For cases depicted by the graphic, being vaccinated is positively associated with
seniors.
(C) Age is a confounder in the association between infection and vaccination status.
(D) There were about twelve cases on average daily (correct to the nearest whole
number) for those senior and vaccinated.
(B) and (D) are correct. Representing the information from the graph into the following 2-by-2 contingency table (with the stated conditions to form our categories) will
make the calculations for the necessary rates much easier.
Seniors
Non-seniors
Column total
Vaccinated
167
479
646
Unvaccinated
22
187
209
Row total
189
666
855
We are not able to compare the rate of infection among seniors with the rate of
infection among non-seniors because we do not have figures for non-infected persons.
Thus, we cannot compare rate(Infected | Seniors) vs. rate(Infected | Non-seniors).
However, we can compare rate(Vaccinated | Seniors) vs.
seniors). Using the table:
rate(Vaccinated | Non-
rate(Vaccinated | Seniors) = 167/189 = 88.36%, which is larger than rate(Vaccinated
| Non-seniors) = 479/666 = 71.92%. So being vaccinated and being senior are indeed
positively associated with each other.
With reference to the title of the graph, we take the number of cases for those senior
and vaccinated (which is 167) and divide it by 14 days. Since 167/14 = 11.9, it is true
that there were about 12 cases on average daily (correct to the nearest whole number)
for those senior and vaccinated.
To establish that age is a confounder, we need to find that age is associated with
both infection and vaccination status. However, as mentioned above, we have insufficient information to show that age and infection are associated with each other. So
17
we cannot determine if age is a confounder in the association between infection and
vaccination status.
28. A researcher is investigating the relationship between alcohol consumption and the
risk of an individual contracting liver cancer. He conducts his investigation in a town
consisting of 10000 adults. Every adult is classified as an Alcoholic or Non-alcoholic
and similarly every adult’s risk of liver cancer is classified as High or Low. During his
investigation he realises that
rate(Male | Alcoholic) ̸= rate(Male)
and
rate(Female | High) < rate(Female).
Based only on the information above, which of the following statements must be true?
(A) Being alcoholic is positively associated with being male.
(B) High risk of liver cancer is positively associated with being female.
(C) Sex is a confounder in this study.
(D) There is an association between the risk of liver cancer and alcohol consumption.
Answer is (C). Based on the information given in the question, we cannot determine
whether
rate(Male | Alcoholic) > rate(Male | Non-alcoholic),
so we cannot conclude that being alcoholic is positively associated with being male.
Option (B) is false since by the basic rule of rates, we will have
rate(Female | High) < rate(Female) < rate(Female | Low),
meaning that the association between high risk of liver cancer and being female is
negative. Option (C) is true because by the basic rule of rates we must have either
rate(Male | Alcoholic) > rate(Male) > rate(Male | Non-alcoholic),
or
rate(Male | Alcoholic) < rate(Male) < rate(Male | Non-alcoholic).
So in either case we obtain that
rate(Male | Alcoholic) ̸= rate(Male | Non-alcoholic).
Therefore sex is associated with alcohol consumption. By the fact that there is negative
association between high risk of liver cancer and being female, we see that sex is
associated with the risk of liver cancer. Therefore sex is a confounder. We do not have
any information to determine the association between alcohol consumption and liver
cancer so we cannot tell whether there is an association between alcohol consumption
and risk of liver cancer.
29. In a study investigating the association between having pets and depression, it is
found that among males, rate(Depressed) = 0.4 and rate(Depressed | Having pets)
= 0.5, whereas among females, rate(Depressed) = 0.3 and rate(Depressed | Having
pets) = 0.6. Consider the following two statements:
(I) It is possible to observe Simpson’s Paradox in this study.
(II) Among females, rate(Having pets | Depressed) > rate(Having pets | Not depressed).
Which of the above two statements is/are true?
(A) Only (I).
18
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (B). Being depressed is positively associated with having pets in each
gender subgroup due to basic rule on rates, since rate(Depressed | Having pets) >
rate(Depressed) implies that rate(Depressed) > rate(Depressed | Not having pets).
Statement (I) is false because rate(Depressed) is always between 0.3 and 0.4 and
rate(Depressed | Having pets) is always between 0.5 and 0.6 by the basic rules of
rates. Thus, being depressed is positively associated with having pets in the overall
group and no Simpsons paradox is observed. As having pets is positively associated
with being depressed among females, statement (II) is true by symmetry rules of rates.
30. A Taylor Swift fan website forum conducted a post-concert survey for its members
who attended The Eras Tour in 2023. The information collected noted the seating
types of respondents (“Seating” or “Standing”), self-evaluations of their concert experience (“Good” or “Bad”), as well as their gender (“Male” or “Female”). The data
is represented in the following table.
Good
Bad
Seating
Male Female
30
39
6
5
Standing
Male Female
20
33
1
14
Which of the following statements is/are true based only on the data above? Select all that apply.
(A) A “Good” concert experience is positively associated with “Standing” seats.
(B) A “Good” concert experience is positively associated with “Seating” seats.
(C) A “Good” concert experience is negatively associated with females.
(D) Gender is a confounder in the association between concert experience and seating
type.
(E) Simpson’s Paradox is observed in the association between concert experience and
seating type.
(B), (C) and (D) are correct. Checking for the association between concert experience
and seating type:
rate(Good | Seating) = 69/80 > rate(Good | Standing) = 53/68,
therefore (A) is not true and B is true. Checking for the association between concert
experience and gender:
rate(Good | Female) = 72/91 < rate(Good | Male) = 50/57,
therefore (C) is true. Checking for the association between seating type and gender:
rate(Seating | Female) = 44/91 < rate(Seating | Male) = 36/57,
therefore (D) is not true. Among males, checking the direction of association between
concert experience and seating type:
rate(Good | Seating) = 30/36 < rate(Good | Standing) = 20/21;
while among females, checking the direction of association between concert experience
and seating type:
rate(Good | Seating) = 39/44 > rate(Good | Standing) = 33/47,
therefore (E) is not true.
19
1. The results of 10 students in a test are given below, where L indicates that a student
obtained a score below 55, and the two H’s (not necessarily the same) indicate scores
above 90. The maximum possible score is 100.
60
80
L
70
H
83
59
H
70
65.
Which of the following statements must be true about the scores given above?
Select all that apply.
(A) The median score is 70.
(B) There is only one mode among them.
(C) The range is greater than 35.
(D) The mean score is greater than 72.2.
(A) and (C) are true. After reordering the 10 scores, we have:
< 55
59
60
65
70
70
80
83
> 90
> 90.
The median is the average of the 5th and 6th score, i.e. 70. Here, 70 is one mode.
However, we may have two modes, if both scores above 90 are the same. Hence, the
statement on mode may not be true.
The range is greater than 90 − 55 = 35. Hence, the statement on range is true. Here,
72.2 =
55 + 59 + 60 + 65 + 70 + 70 + 80 + 83 + 90 + 90
,
10
It is possible that the scores corresponding to L, H and H are for example, 50, 91, 91
respectively. Then the mean score will be 71.9. Hence the statement on mean may
not be true.
2. Which of the statements below will always hold if an outlier for a single numerical
variable is removed?
(I) The mean will decrease.
(II) The median will decrease.
(III) The mode will decrease.
(A) Only (I).
(B) Only (I) and (II).
(C) Only (II) and (III).
(D) None of the statements.
Answer is (D). We can give a suitable example to show that all the statements do
not hold. In the set of numbers (−1000, 0, 0, 1), the outlier is −1000. The mean
values with and without the outlier are −249.75 and 0.33 respectively. Hence the
mean actually increases without the outlier. It is clear that the median and mode of
0 remain unchanged after the removal of the outlier.
3. The graph below compares two sets of grades of a cohort of 233 students who took a
course.
1
The histograms of the same two sets of grades are provided below.
Based on the figures, which of the following statements is/are correct?
Select all that apply.
(A) The boxplot on the left corresponds to the midterm grades.
(B) The median of the final grades is not higher than that of the midterm grades.
(C) The bin width of the histogram for final grades is larger.
(D) There are more outliers in the set of final grades than that of the midterm grades.
(B) and (D) are true. Since the minimum value in the boxplot on the right is below
40, the boxplot on the right must correspond to the midterm grades. Thus the boxplot
on the left is for the final grades. We can see from the boxplots that the median of the
final grades is slightly lower than that of the midterm grades. The bin widths of both
histograms are 10. It is clear that there are outliers in the boxplot for final grades but
not in the boxplot for midterm grades.
4. The Registry of Marriages is interested to see the relationship between the ages of
husbands and wives in City X. They randomly sampled 1000 pairs of husbands and
wives from the population of City X and obtained data of their ages (in years). Looking
through the data, they found that men always marry women who are younger than
them. Based only on the information given above, which of the following statements
must be true?
(I) The average age of the husbands is more than the average age of the wives.
(II) The standard deviation of husband’s age is more than the standard deviation of
wife’s age.
(A) Only (I).
2
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (A). Since all the husbands are older than their respective wives, the average
age of the husbands would be greater than the average age of the wives, and thus
statement (I) is correct. On the other hand, we do not have information about the
spread of the husband’s age relative to the spread of the wife’s age, so we cannot be
sure if statement (II) is correct.
5. The following boxplot shows the final examination marks of students from class A.
The passing mark for the final examination is 12 out of 20. Suppose that the boxplot of the final examination marks for class B is the same as the boxplot for class
A. What can be said about the final examination marks of students from class B?
Select all the correct statements.
(A) The proportion of students in class B who passed the examination must be the
same as that for class A.
(B) At least 50% of the students from class B failed the final examination.
(C) The standard deviation of the students’ marks in class B is equal to the standard
deviation of those in class A.
(D) Based on the boxplot, there are no outliers in the grades of students from class
B.
(E) The average of the students’ marks for class B must be equal to that for class A.
(B) and (D) are correct. The boxplot above does not tell us anything about the
percentage of students who obtained at least 12 out of 20 in the final examination.
3
The median of the final marks is 10, which implies that at least 50% of students failed
the examination.
The boxplot does not tell us anything about the average and standard deviation of
the final examination marks.
It is clear from the boxplot that there are no outliers.
6. The five-number summary of a numerical variable with 47 values is:
Min
12.0
Q1
15.0
Median
16.5
Q3
18.0
Max
24.0
Which of the following statements must be true? Select all that apply.
(A) There are no outliers in the data.
(B) There is at least one low outlier in the data.
(C) There is at least one high outlier in the data.
(D) There are both low and high outliers in the data.
Only (C) is true. The IQR is 18 − 15 = 3. Since 24 > 18 + 1.5 × 3 = 22.5, 24 is a high
outlier and thus there is at least one high outlier. As 15 − 1.5 × 3 = 10.5 and there
are no values smaller than 12, there are no low outliers.
7. Suppose that the following are 10 data points for a numerical variable X:
3, 50, r, 8, 20, 1, 32, 58, 10, 138,
where r is an unknown whole number and r ̸= 138. Based on the definition of an
outlier for a boxplot, if 138 is the only outlier in this data set, the maximum possible
.
value of r is
Answer is 133. Firstly, in order to have 138 as the only outlier, we must have r < 138.
Then since we want to find the maximum possible value of r, when we order the other
9 points from the smallest to the largest value, we can try to let r be a number between
the second largest number and 138. That is,
1, 3, 8, 10, 20, 32, 50, 58, r, 138.
We can see that Q1 = 8, Q3 = 58 and IQR= 50. Thus, any number greater than
Q3 + 1.5 × IQR = 58 + 1.5 × 50 = 133 will be an outlier. This means that the
maximum possible value of r is 133.
8. Alice wants to compare the distribution of car transaction prices in 2020 with that in
2021. In her comparison, she also wants to highlight that the number of transactions
in 2021 is much larger than that in 2020. Which of the following best serves her
purpose?
(A) Plotting two boxplots (one for each year) of car transaction prices along the same
Y -axis.
(B) Plotting two histograms (one for each year) of car transaction prices along the
same Y -axis.
(C) Plotting a bar chart of the number of transactions against year.
(D) Plotting a scatter diagram of car transaction prices in 2021 against car transaction
prices in 2020.
4
Answer is (B). Only (A) and (B) incorporate tools to compare distributions of numerical data. Since a histogram displays information about the size of the data set and
a boxplot does not, only (B) can show the difference in the number of transactions
between the two years.
9. The CEO of a company wishes to find out the level of job satisfaction that his employees have. His company has 876 employees. He administers an anonymised survey
in which there are questions that ask about the employees’ satisfaction with regards
to various aspects of their jobs, including welfare, remuneration, career progression,
learning etc.
For each question, employees are to rate their satisfaction on a scale of 1-9. In this
scale, 1 represents the lowest level of satisfaction whilst 9 represents the highest level
of satisfaction. Broadly, any score below 5 for a question is regarded as “not satisfied”
and any score above 5 for a question is regarded as “satisfied” while a score of 5
represents being “neutral”.
Every employee fills in the survey based on how he/she feels about the job and the
data is collected. Assume that every employee is honest in the response and is fully
aware of what the values on the scale represent.
What type(s) of analysis is/are considered appropriate for the data on the satisfaction
scores? Select all that apply.
(A) For each question, one can perform a summary statistics calculation on the satisfaction scores to obtain the mean, standard deviation, median, as well as the
interquartile range, thereby giving quantitative information on the employees’
satisfaction as a whole.
(B) For each question, one can plot a histogram to depict the distribution of the
satisfaction scores and perform a calculation to determine the exact degree of
skewness of the histogram.
(C) For each question, one can calculate the proportion of employees who gave each
satisfaction score and determine what percentage of employees are “not satisfied”
and what percentage are “satisfied”.
Only (C) is correct. As satisfaction scores can be subjective and differences in satisfaction levels need not be the same, this variable cannot be treated as a numerical
variable and is an ordinal variable. Hence, while we can compute summary statistics and perform arithmetic, after doing so, we are unable to establish any numerical
meaning from these statistics.
A histogram is used to plot the distribution of a numerical variable, so once again it
is not appropriate to use a histogram to plot this data and we should not perform any
calculation to determine its skewness.
For each question, it is appropriate to calculate the proportion of responses for each of
the satisfaction scores and then determine the overall proportion of employees who are
“satisfied” and “not satisfied” which is generally how we should deal with categorical
variables.
10. The following histogram and boxplot are constructed using 90 observations of the numerical variable x. Based on the graphics, determine which of the following statements
must be true.
5
(A) The minimum value is 2.73.
(B) The distribution is right-skewed.
(C) Only a quarter of the observations is higher than 4.32.
(D) None of the other statements must be true.
Answer is (D). The minimum value corresponds to the outlier shown in the boxplot,
which is lower than 2.73. The distribution is left-skewed. 4.32 is the median, hence
(C) is also not true.
11. Suppose that the following are 10 data points for a numerical variable X:
−150, 50, r, 8, 20, 1, 32, 70, 10, 5,
where r is an unknown number and r ̸= −150. Based on the definition of an outlier
for a boxplot, if −150 is the only outlier in this data set, the minimum possible value
of r is
(A) −149.9.
(B) −45.5.
(C) −30.
(D) −14.5.
Answer is (B). Firstly, in order to have -150 as the only outlier, we must have r > −150.
Then since we want to find the minimum possible value of r, when we order the other 9
points from the smallest to the largest value, we can try to let r be a number between
-150 and the second smallest number. That is, −150, r, 1, 5, 8, 10, 20, 32, 50, 70. We
can see that Q1 = 1, Q3 = 32 and IQR = 31. Thus, any number smaller than
Q1 − 1.5 × IQR = 1 – 1.5 × 31 = −45.5 will be an outlier. This means that the
minimum possible value of r is -45.5.
12. The graph below shows a scatter plot between exam scores for 100 students and their
corresponding number of days absent from school throughout a year. The maximum
mark for the exam is 100. There is an outlier present in the scatter plot.
6
From the scatter plot above, which of the following statements is/are true? Select all that apply.
(A) After removing the outlier, the strength of the linear association between the two
variables will increase.
(B) After removing the outlier, the correlation coefficient between the two variables
will increase.
(C) Higher absenteeism rate causes lower exam scores.
(D) If a linear regression line is fitted to this data, the gradient of the line will be
negative.
(A) and (D) are true. The correlation between the two variables is negative. After
removing the outlier, the strength of the linear association between the two variables
will increase, hence (A) is true. The correlation coefficient after removing the outlier
will decrease as it will go closer to -1, hence (B) is false. (C) is false as this statement
is causation, not correlation. (D) is true as the sign of the correlation coefficient can
tell us the sign of the gradient of the regression line. Since the correlation coefficient
is negative, as seen in the scatter plot, the gradient of the regression line will also be
negative.
13. Mary wanted to find out if the duration spent on playing video games is associated
with the amount of sleep one gets. She collected data from 5000 participants who
played video games for 6 to 10 hours each day. She found that the participants’
number of sleep hours varied from 6 to 8 hours each day, and calculated the formula
of the linear regression line to be
y = 11 − 0.5x,
where x represents the number of hours spent playing video games each day and y
represents the number of sleep hours each day. From the data, what is a reliable
prediction of the average number of sleep hours for a person who plays an hour of
video games each day?
(A) 10.
(B) 9.
(C) 10.5.
7
(D) 20.
(E) There is insufficient data to make a reliable prediction.
Answer is (E). The linear regression line formula would not provide us a reliable
prediction of the average number of sleep hours for hours spent playing video games
beyond the observed range of 6 to 10 hours. The best fit line may change beyond the
observed range or there could be a non-linear relationship between the two variables.
14. The following scatter plot has an outlier, which has both positive X and Y values. X
ranges from 0 to 100 and Y ranges from 0 to 30. What will happen to the correlation
coefficient between X and Y if we remove the outlier?
(A) It will remain the same.
(B) It will decrease.
(C) It will increase.
Answer is (B). Initially, the correlation coefficient is positive. When we remove the
outlier, the data points form an “ellipse” and thus the new correlation coefficient
between X and Y is close to 0, meaning that the correlation coefficient decreased.
15. There are at least 3 known scales that are used to measure temperature: namely, the
Kelvin scale (denoted by K), the Fahrenheit scale (denoted by o F), and the Celsius
scale (denoted by o C). The relationships between the 3 scales are as follows:
Temperature in Fahrenheit = 59 × Temperature in Celsius +32.
Temperature in Kelvin = Temperature in Celsius +273.15.
Within a data set, it is known that the correlation coefficient between numerical
variables x and T is 0.72 where T is the temperature measured in Degrees Fahrenheit.
The equation of the regression line is T = 0.32x + 87 for x ranging from 0 to 5
inclusive. Based on this information, which of the following statements is/are true?
Select all that apply.
(A) If we change the scale of temperature to Kelvin, the correlation coefficient and
gradient of the regression line will remain unchanged.
(B) If we change the scale of temperature to Degree Celsius, the correlation coefficient
and gradient of the regression line will remain unchanged.
8
(C) If we change the scale of temperature to Kelvin, only the correlation coefficient
remains the same but the gradient of the regression line changes.
(D) If we change the scale of temperature to Degree Celsius, only the correlation
coefficient remains the same but the gradient of the regression line changes.
(E) If we change the scale of the temperature to Kelvin, both the correlation coefficient and the gradient of the regression line will change.
(C) and (D) are true. The correlation coefficient is unchanged by adding and subtracting constants as well as multiplying and dividing by positive constants. Using
the relationship
sy
m=r
sx
when changing from Fahrenheit to Celsius, it involves multiplying by 59 which will cause
the standard deviation of temperature to change. But since r remains unchanged, the
gradient therefore has to change. But a change from Celsius to Kelvin does not cause
any further change to the gradient since we are only adding a constant. Therefore, a
change from Fahrenheit to Kelvin will result in a change to the gradient.
16. Which of the following statements is/are true?
(I) A correlation coefficient of 0 means that there is no linear association between
the two variables.
(II) A correlation coefficient of −0.8 indicates a weaker linear association than a
correlation coefficient of 0.7.
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (A). Statement (I) is true. The correlation coefficient is a measure of linear
association between two variables, but it is unable to measure non-linear association.
Statement (II) is false as the correlation coefficient of −0.8 shows a stronger linear
association than 0.7. This is because the magnitude of the first correlation coefficient,
0.8, is greater than the second one, 0.7. The negative sign only indicates the direction
of the correlation, but not the strength.
17. Xiao Ming finds that the heavier a person is, the higher his pulse rate tends to be.
A linear regression model he built suggests that 20-kilogram differences in weight
correspond to differences in pulse rates of 5 beats per minute. Which of the following
statements is/are always correct?
(I) The correlation coefficient between body weight and pulse rate is 41 .
(II) Your pulse rate slows down 3 beats per minute if your weight decreases by 12
kilograms.
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (C). Statement (I) is false as the gradient of the regression line ( 14 in this case)
need not be the same as the correlation coefficient between the variables. Statement
9
(II) is false as the regression line gradient gives only the “predicted average value of
y” given a value of x.
18. A researcher predicts the total number of bacteria in an experiment (denoted by y)
using simple linear regression on ln y vs x. The regression equation is given by
ln y = 0.5x + 2.5,
where the logarithm here is taken to base e, and x is the number of hours since 12PM.
Which of the following statements is/are always correct?
(I) According to the researcher’s model, there will be an average of 33 (rounded to
the nearest integer) bacteria in 2 hours.
(II) The average number of bacteria is predicted to increase by the same amount
every hour.
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (A). When x = 2, ln y = 3.5. Thus, we obtain y = e3.5 = 33. Statement
(II) is not correct. We can verify this explicitly. For instance, from x = 0 to x = 1,
the predicted average number of bacteria increases by 8. From x = 1 to x = 2, the
predicted average number increases by 13.
19. A group of students wanted to investigate the relationship between bubble tea and
ice cream consumption. They conducted a survey to find out the number of cups of
bubble tea and scoops of ice cream consumed in a year. From the data collected, they
drew a scatter plot and fitted a linear regression line to the data with the equation of
y = 0.73x.
Based on the information given in the scatter plot, which statement(s) must be true?
(I) The correlation between bubble tea and ice cream consumption is 0.73.
(II) People who consume 100 cups of bubble tea are predicted to consume 73 scoops
of ice cream on average.
10
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (D). The regression line equation tells us about the slope and intercept
of the best-fit line. In this case, the slope is 0.73 and the intercept is 0. The slope
may not be equal to the correlation coefficient in general so statement (I) may not
be true. There is no information about people who consume more than 70 cups of
bubble tea from the current data. Hence we cannot use the regression line to predict
the average number of scoops of ice cream consumed for someone who consumed 100
cups of bubble tea. Hence statement (II) is not true.
20. There are two primary six classes in a tuition center. Class A and Class B each has
100 students and all students sat for a mathematics midterm test as well as a final
examination. In Class A, every student scores 1 point higher in the final examination
than in the midterm. In Class B, every student scores 1 point lower in the final
examination than in the midterm. For the midterm test, the average score is 50 and
standard deviation is 20 for both classes A and B. Which of the following statements
is/are correct?
(I) The correlation coefficient between the final examination score and the midterm
score in Class A is 1.
(II) The correlation coefficient between the final examination score and the midterm
score in Class B is −1.
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (A). Note that the relationship between the midterm score and final examination score is deterministic. Furthermore, since standard deviation is 20, not all
students scored the same midterm/final examination mark. For Class A, imagine the
points (49,50), (50,51), (52,53) and so on, each (x, y) representing the midterm score
x and final exam score y. Clearly, these points lie on the straight line y = x + 1 with
positive gradient. So the correlation coefficient between final examination score and
the midterm score in Class A is 1. Similarly, for Class B, imagine the points (49,48),
(50,49), (52, 51) and so on. These points also lie on a straight line y = x − 1 with
positive gradient. The correlation coefficient between final examination score and the
midterm score in Class B is also 1.
21. A student produces a scatter plot whereby the y values range from 1 to 3, and he
observes a linear association, with regression equation y = 1.5x + 5. Which of the
following statements is/are correct?
(I) The correlation coefficient must be positive.
(II) The average y value of the data set collected is 2.
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
11
(D) Neither (I) nor (II).
Answer is (A). Since the regression line has a positive slope, the correlation coefficient
must be positive and thus statement (I) is correct. While we know that the y values
ranges from 1 to 3, it does not mean that the average value of y is 2, so statement (II)
is not correct.
22. A set of bivariate data has the following properties:
n = 20, x = 4,
y = 5,
sx = 1,
sy = 2,
r = 0.5,
0 ≤ x ≤ 10.
For x = 5, the predicted average value of y is
.
Answer is 6. Let the regression equation be y = mx + b. Note that
sy
m = r = 1,
sx
and thus b = y − mx = 1. The regression line is thus y = x + 1. When x = 5, the
predicted average value of y is 6.
23. Dash is interested to find out if there is a correlation between the time spent in playing
computer games (in hours) and student’s score (out of 100). After slicing the data
according to gender, it was found that the regression line for males is
score = −1.5 × hours spent gaming + 85
and the regression line for females is
score = −1.3 × hours spent gaming + 87.5.
Furthermore, Simpson’s Paradox is observed when the 2 subgroups are combined.
Which of the following is/are possible ranges of r, the correlation coefficient between
the time spent gaming and the students’ score when the subgroups are combined?
(A) 0 ≤ r ≤ 1.
(B) −1 ≤ r < 0.
(C) There is not enough information to guess the range of r.
Answer is (A). Since the coefficient of hours spent gaming is negative for both subgroups, this must also mean that the correlation coefficient between hours spent gaming and score must be negative. Additionally, we also know that Simpson’s Paradox is
observed when the 2 subgroups are combined. Hence, the association between hours
spent gaming and score must be nonnegative, which means that 0 ≤ r ≤ 1.
24. There are 50 students in a class and the average (or mean) number of pens per student
is 4. Not all students in this class have the same number of pens. For a student with
4 pens, what is the standard unit for the number of pens for this student?
(A) −1.
(B) 0.
(C) 1.
(D) The standard deviation is required to answer this question.
Answer is (B). Note that standard deviation is non-zero here, since not all students
in this class have the same number of pens; there is a spread around the mean 4. For
a student with x = 4 pens, since x = 4, the standard unit for the number of pens for
this student is
x−x
= 0.
sx
12
25. Which of the following gives the most likely correct values of the correlation coefficients
obtained from the four scatter plots? Here, r1 refers to the correlation coefficient of
plot (1), r2 refers to the correlation coefficient of plot (2) and so on.
(A) r1 = 0.8, r2 = 0.92, r3 = 0.45, r4 = −0.7.
(B) r1 = 0.06, r2 = 0.92, r3 = 0.9, r4 = −0.7.
(C) r1 = 0.06, r2 = 0.92, r3 = 0.45, r4 = −0.7.
(D) r1 = 0.06, r2 = 0.92, r3 = 0.45, r4 = 0.05.
Answer is (C). For plot (1), there is a strong non-linear association, but the linear
correlation is not strong. For plot (2), there is a relatively strong linear correlation.
Plot (3) shows a weak linear correlation and plot (4) shows a negative correlation.
Considering all the above facts, (C) is the best option.
26. A researcher wishes to examine the association between the number of hours students
engage in gaming and their academic performance. He conducts his study in 15 schools,
and calculates for each school the average number of hours students spend gaming and
the average academic score of students. He notices that there is a correlation of −0.9
between the two sets of averages. He concludes that
“The correlation between gaming hours and academic score for all students from the
15 schools is −0.9”.
What fallacy is the researcher committing here?
(A) Atomistic fallacy.
(B) Ecological fallacy.
Answer is (B). The researcher is trying to obtain the correlation at the individual level
solely from the correlation at the aggregate level. This is an example of ecological
fallacy.
27. In a study on the impact of exercise on weight loss, researchers gather data on the
exercise habits and weight changes of all students in a particular school over a sixmonth period. After analysing the data, they find that for individual participants,
there is no correlation between the amount of exercise they engage in and the amount
13
of weight they lose. Some participants lose weight without exercising much, while
others exercise regularly but don’t experience weight loss.
Based on these individual-level findings, the researchers conclude that if they were to
compute the average exercise duration and average weight loss for each class in the
school, there will also be no correlation between these aggregated averages.
Which one of the following mistakes have been committed?
(A) Atomistic Fallacy.
(B) Ecological Fallacy.
(C) Confusing correlation with causation.
(D) None of the other options is correct.
Answer is (A). This conclusion is an Atomistic Fallacy because it incorrectly generalizes
the lack of a correlation at the individual level to the aggregated level. The study fails
to account for potential variations among individuals and the possibility that exercise
may have different effects on weight loss for different people. It is possible that while
at the individual level we may not observe any correlation between exercise and weight
loss, at the aggregated level, exercise could still be correlated with weight management.
28. Which of the following is an example of Simpson’s Paradox?
(A) The gradient of the regression line for y given x for all individuals in a data set is
−0.2. When partitioned into 3 different subgroups, the gradient of the regression
line for the same two variables are 2, 3.4 and −5.
(B) The overall correlation between x and y in a data set is 0.01. When partitioned
into 3 different subgroups, the correlation between the same two variables are
0.21, 0.13 and −0.7.
(C) The intercept of the regression line for y given x for all individuals is 1. When
partitioned into 3 different subgroups, the intercept of the regression line for the
same two variables are −2, −0.2, −1 and −0.3.
(D) The overall correlation between x and y in a data set is 0.2. When partitioned
into 4 different subgroups, the correlation between the same two variables are
−0.43, −0.36, 0.51 and 0.91.
Answer is (A). Intercept does not show us the direction of the association and all 4
could be the same direction as the overall. Correlation and gradient both show us
direction but only option (B) has more than half of the subgroups having a different
direction of association than the overall.
29. Suppose that the relationship between two variables can be best modelled by an exponential function of the form y = axb , where a and b are positive constants and x > 0.
Which set of variables should we plot if we wish to observe a linear relationship?
(A) y versus ln(x).
(B) ln(y) versus x.
(C) ln(y) versus ln(x).
Answer is (C). y = axb can be transformed into ln(y) = ln(a) + b ln(x), which is in the
form of a linear equation Y = mX + c, with Y = ln(y) and X = ln(x).
30. Suppose you are a student researcher interested in studying the relationship between
the average number of hours students spend studying in a day (independent variable)
14
and their final exam scores (dependent variable). You collect data from a sample of 50
students, recording the average number of hours each student studied a day and their
corresponding final exam scores. From the data collected you gathered the following
information:
3 hours ≤ Number of Hours Spent Studying ≤ 10 hours
2 marks ≤ Final Exam Scores ≤ 35 marks
Equation of the regression line: y = 4x − 10
What will be the predicted marks of a student if he studies for 11 hours on average in
a day?
(A) 4 marks
(B) 9 marks
(C) 34 marks
(D) We are not able to give a good prediction with the given information.
Answer is (D). We are not able to give a good prediction as the x-variable has a
maximum value of 10 hours, while the student studied for 11 hours on average.
15
1. A player rolls a fair six-sided die twice. You can assume the rolls are independent. We
define the following events:
A: The first roll shows numbers 1 or 2.
B: The second roll shows numbers 5 or 6.
C: The sum of the two rolls is less than or equal to 7.
Consider the following statements:
(I) P (B | C) > P (B).
(II) P (A and C) = P (A) × P (C).
Which of the statements above must be true?
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (C). Statement (I) is false since
P (B | C) =
3
1
P (B and C)
= 36
= ,
21
P (C)
7
36
1
while P (B) = 13 . Statement (II) is false because P (A and C) = 11
36 while P (A) = 3
21
7
and P (C) = 36 = 12 .
2. The sample space S of a probability experiment comprises the 30 days of September
2021. Let A denote the first 10 days of September 2021. Let B denote 1st September
2021. Which of the following statements is/are correct? Select all that apply.
(A) A is an event of S and B is an outcome of S.
(B) A is an outcome of S and B is an event of S.
(C) Both A and B are events of S.
(D) Both A and B are outcomes of S.
(A) and (C) are correct. A is a sub-collection of S so it is an event of S. B is an
element of S so it is an outcome of S. Since we also regard outcomes as events, both
A and B are events of S.
3. A study was conducted on 1000 students in a secondary school on 11th January 2021.
On that day, of these 1000 students, 80% of the students took the MRT (mass rapid
transit) to school. Among those students that took the MRT to school, 10% were late
for school. Among those students who did not take the MRT to school, 20% were late
for school.
Among those students who participated in the study and were late for school, a student
was randomly chosen, with every student having the same chance of being selected.
What is the probability (correct to 1 decimal place) that the chosen student did not
take the MRT?
(A) 20.0
(B) 10.0
(C) 33.3
1
(D) 30.0
Answer is (C). The total number of students is 1000 and since P (Take MRT) = 0.8
and P (Did not take MRT) = 0.2, hence 800 students took MRT and 200 students did
not.
Among 800 students who took MRT to school, 10% were late. That is,
P (Late | Take MRT) = 0.1,
so there were 80 students who took the MRT and were late.
Among 200 students who did not take MRT to school, 20% were late. So
P (Late | Did not take MRT) = 0.2,
so there were 40 students who did not take the MRT and were late.
Put the calculated figures in a 2 × 2 table as shown below.
Late
80
40
120
Took MRT
Did not take MRT
Column total
Not late
720
160
880
Row total
800
200
1000
From the table, we can calculate
P (Did not take MRT | Late) =
40
= 33.3%
120
(1 decimal place).
4. A football match is played between Singapore and Italy, at the Singapore National
Stadium. 90% of the spectators in the stadium are wearing red. Among the spectators
in the stadium wearing red, 80% are rooting for Singapore. Among the spectators in
the stadium not wearing red, 80% are rooting for Singapore.
Among all the spectators in the stadium, a spectator is randomly chosen with every
spectator having the same chance of being chosen.
Let A be the event that the chosen spectator is rooting for Singapore, and B be the
event that the chosen spectator is wearing red. Which of the following statements,
regarding the 2 events A and B is true?
(A) The 2 events are mutually exclusive and independent.
(B) The 2 events are mutually exclusive but not independent.
(C) The 2 events are independent but not mutually exclusive.
(D) The 2 events are neither independent nor mutually exclusive.
Answer is (C). By the basic rule on rates, observe that
P (Root for Singapore | Wear red)
=
P (Root for Singapore | Does not wear red)
=
P (Root for Singapore)
=
0.8
Since
P (Root for Singapore) = P (Root for Singapore | Wear red) = 0.8,
rooting for Singapore and wearing red are independent events. Here
P (Root for Singapore) + P (Wear red) = 1.7 ̸= P (Root for Singapore or Wear red),
since P (Root for Singapore or Wear red) ≤ 1. Hence rooting for Singapore and wearing red are not mutually exclusive events.
2
5. Systematic sampling with interval length k = 20 is to be utilised for an exit poll at an
event. Suppose a total of 410 people exit the event. What is the probability that the
57th person to exit the event is selected for the poll?
1
.
(A) 317
1
(B) 20
.
1
(C) 410
(D) Cannot be determined with the information provided.
Answer is (B). The 57th person will be selected if and only if number 17 is chosen as
the starting point, out of the integers 1 to 20. But since the selection of the starting
point is random (meaning every number has the same chance of being selected), each
1
integer from 1 to 20 has a probability of 20
to be selected, so the probability of the
1
57th person being selected is also 20 .
6. A new machine is invented to detect COVID-19. A clinical study is done with a
random sample of size 10000. From the study, both the sensitivity and specificity of
the machine are calculated to be 90%. Which of the following statements must be
correct?
(I) Among those with COVID-19 in the sample, 10% tested negative.
(II) Among those who tested negative in the sample, 10% have COVID-19.
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (A). Statement (I) is correct. Sensitivity = 90% means that among those
with COVID-19, 90% are tested positive. Thus among those with COVID-19, 10%
are tested negative.
Statement (II) may not be correct. Specificity of the test is P(Tested negative | Without COVID-19) = 90%. We are unable to deduce further with the limited information,
if exactly 10% of those who tested negative in the sample, have COVID-19. It depends
on the base rate P(COVID-19).
7. There are 3 families X, Y and Z. The families have 2 children each. Family X has 1
boy and 1 girl. Family Y has only 2 girls and Family Z has only 2 boys. 1 child is
randomly selected among the 6 children. If the selected child is a boy, what is the
probability that he is from family X?
(A) 0.
(B) 16 .
(C) 13 .
(D) 12 .
Answer is (C). The required probability is the conditional probability
P (child from family X | child is a boy) =
3
1
P (child is a boy from family X)
1
= 63 = .
P (child is a boy)
3
6
8. Every resident in a town undergoes a new COVID test. The sensitivity and specificity
of the test are 90% and 99% respectively. Among those who tested positive, 60% are
actually infected with the virus. Which of the following is closest to the base rate for
COVID infection in this town?
(A) 60%.
(B) 16%.
(C) 1.6%.
(D) 0.16%.
Answer is (C). Suppose there are 1000 residents tested positive. By the information
provided, 600 are infected with the COVID virus and 400 are not.
Infected
Not infected
Column total
Positive test
600
400
1000
Negative test
?
?
?
Row total
?
?
?
As the sensitivity of the test is 90%, this means that
P (Positive test | Infected) = 0.9 =
600
,
x
where x is the total number of residents infected with COVID. This gives us x = 667
(rounded to nearest integer).
As the specificity of the test is 99%, this means that
P (Negative test | Not infected) = 0.99 =
y − 400
,
y
where y is the total number of residents not infected with COVID. This gives us
y = 40000.
Since the total number of residents is 667 + 40000 = 40667, the base rate for COVID
667
infection in the town is 40667
× 100% which is approximately 1.6%.
9. For two events A and B, P (A) = 0.5 and P (B) = 0.2. Which of the following
statements is/are always true? Select all that apply.
(A) P (A or B) = 0.7 if A and B are independent events.
(B) P (A or B) = 0.7 if A and B are mutually exclusive events.
(C) P (A and B) = 0.1 if A and B are independent events.
(D) P (A and B) = 0.1 if A and B are mutually exclusive events.
(B) and (C) are correct. For mutually exclusive events, P (A and B) = 0 and P (A or B) =
P (A) + P (B) = 0.7. For independent events, P (A and B) = P (A) × P (B) = 0.1 and
P (A or B) = P (A) + P (B) − P (A and B) = 0.6 (or derived via the intuitive Venn
diagram shown below).
4
10. The probability of flipping a biased coin and getting two tails in a row is 0.16. Each
flip is independent of the other. The probability of flipping that same coin and getting
two heads in a row is
.
Fill in the blank above. Give your answer correct to 2 decimal places.
The answer is 0.36. Each flip is√ independent of the other, which means that the
probability of getting one tail is 0.16 = 0.4. The probability of getting one head is
then 1 − 0.4 = 0.6. Getting two heads in a row will be 0.6 × 0.6 = 0.36.
11. An airline is conducting a lucky draw event on 10 July 2023. It will first make a list of
all its 50 flights on that day and then randomly selects one flight from the list. After
that, 10 passengers will be randomly chosen from the selected flight to receive lucky
draw prizes. It is confirmed that there will be 120 passengers on each of the 50 flights
that day. Suppose that Amy will be on exactly one of the flights on 10 July 2023. The
˙
probability of Amy receiving a lucky draw prize from the airline is
(Give
your answer in percentage correct to 2 decimal places.)
Answer is 0.17%.
P (Amy receiving a lucky draw prize)
=
P (Amy is on the selected flight AND she is one of the 10 chosen passengers on the flight)
=
1/50 × 10/120 = 0.17%(correct to 2 decimal places.)
12. Camera lenses are made by two companies, A and B. Suppose that 60% of all lenses
made are by Company A, while the rest are made by Company B. 7% of all lenses made
by Company A are faulty, while 9% of all lenses made by Company B are faulty. One
of the lenses was selected at random and observed to be faulty. What is the probability
that it was produced by Company A? Give your answer as a number between 0 and
1, to the nearest 2 decimal places.
Answer is 0.54. P(A | faulty) = P(A and faulty)/P(faulty). While P(A and faulty) =
P(faulty | A) × P(A) = 0.07 × 0.6 = 0.042, we use the Law of Total Probability to
compute P(faulty). That is, P(faulty) = P(faulty | A) × P(A) + P(faulty | B) × P(B)
5
= 0.07×0.6+0.09×0.4 = 0.078. Therefore, P(A | faulty) = P(A and faulty)/P(faulty)
= 0.042/0.078 = 0.54.
13. A poll conducted before a local election gives a 95% confidence interval for the percentage of voters who support candidate X as (54%, 60%). Based on the same poll result,
which of the following can potentially be a 99% confidence interval for the percentage
of voters who support candidate X?
(A) (56%, 58%).
(B) (52%, 62%).
(C) (54%, 60%).
Answer is (B). A 99% confidence interval should be wider than the 95% confidence
interval for the same sample.
14. Based on a random sample of 200 staff members in NUS, the 95% confidence interval
for the proportion of all NUS staff who went on vacation for at least 5 days in 2018 is
(0.33, 0.59). Which of the following statements must be true?
(I) If another sample of size 500 is drawn using the same sampling method, for the
same confidence level, the confidence interval will be wider than (0.33, 0.59).
(II) A maximum of 59% of all NUS staff went on vacation for at least 5 days in 2018.
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (C). Statement (I) is false since a bigger sample size will likely result in a
narrower confidence interval for the same confidence level. Statement (II) need not be
true as the interval (0.33, 0.59) is not guaranteed to contain the population proportion.
15. A researcher takes a random sample from Country X’s population to estimate its
unemployment rate. From the sample, the researcher obtains the 95% confidence
interval for the population unemployment rate to be between 0.18 and 0.22.
Which of the following statements correctly interprets the results?
(A) If many samples of the same size were collected using the same procedure, and
their respective confidence intervals calculated in the same way, about 95% of
these samples will have the sample unemployment rate lie between 0.18 and 0.22.
(B) If many samples of the same size were collected using the same procedure, and
their respective confidence intervals calculated in the same way, about 95% of
these samples will have the population unemployment rate lie between 0.18 and
0.22.
(C) If many samples of the same size were collected using the same procedure, and
their respective confidence intervals calculated in the same way, about 95% of
these samples will have the sample unemployment rate lie within the samples’
respective confidence intervals.
(D) If many samples of the same size were collected using the same procedure, and
their respective confidence intervals calculated in the same way, about 95% of
these samples will have the population unemployment rate lie within the samples’
respective confidence intervals.
6
Answer is (D). By the interpretation of confidence intervals via repeated sampling, we
conclude that about 95% of the samples will contain the population parameter (in this
case the population unemployment rate) within their respective confidence intervals.
16. A nutritionist wants to estimate the average mass of a particular tortilla chip brand,
Naritos. He decides to collect a sample of 100 packets of Naritos chips, using the
method described as follows: out of all the factories that produce chips, he goes down
to the nearest factory that produces Naritos chips on a particular day and measures
the average mass of the first 100 packets of Naritos chips produced on that day. You
can assume the mass of the packaging is negligible.
Using the formula
s
x ± t∗ × √ ,
n
he obtains a 95% confidence interval of [676.3, 723.7] for the population mean. Which
of the following statements must be true?
(I) If we collect 100 samples of 100 packets of Naritos chips using the same sampling
method, about 95% of the samples will result in confidence intervals containing
the population average mass of Naritos chips.
(II) We can conclude that there is a 95% chance that the population average mass of
Naritos chips lies within [676.3, 723.7].
(A) Only (I).
(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Answer is (C). Recall that
Sample statistic = population parameter + bias + random error.
For a confidence interval to be valid, we need the assumption that bias is approximately
zero, so that
Sample statistic = population parameter + random error.
However, in this scenario, a probability sampling method is not used to acquire the
sample, so selection bias will interfere with the sample estimate of the population
mean mass of Naritos chips.
It is possible that many samples out of the 100 samples collected have a sample
mean that overstates the true population mean by a large amount. For instance, it is
possible that the first 100 packets of chips produced each day are a lot heavier than the
average packet of chips. Thus we do not expect to see 95 out of 100 of the confidence
intervals generated containing the true population mean. Hence, statement (I) is false.
Furthermore, we do not use the word “chance” that the population mean lies within
a given range. Thus statement (II) is also false.
17. A random sample of size 500 is taken from a population of 10000 people of age 50. From
the sample, a 95% confidence interval for the population mean weight is constructed.
Which of the following statements is/are correct?
(I) The confidence interval will always contain the sample mean weight.
(II) If many samples of the same size are collected using the same sampling method,
about 5% of the confidence intervals from these samples will not contain the
population mean weight.
7
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (C). A confidence interval contructed from a sample will always contain
the sample mean (while we cannot be sure if it will contain the population mean).
For statement (II), the interpretation of confidence intervals via repeated sampling
tells us that about 95% of the confidence intervals from these samples will contain the
population mean, so about 5% of the intervals will not contain the population mean.
18. A 99% confidence interval for the mean height (in meters) of NUS students is [1.58, 1.80].
It is constructed using a random sample of 100 students. Using the same sample, which
of the following is a plausible 95% confidence interval for the mean height?
(A) [1.64, 1.86].
(B) [1.61, 1.77].
(C) [1.48, 1.89].
(D) [1.63, 1.85].
Answer is (B). From the 99% confidence interval [1.58, 1.80], we can infer that the
= 1.69 and the margin of error is (1.80−1.58)
= 0.11.
sample mean is (1.80+1.58)
2
2
As we use the same sample to calculate the 95% confidence interval, the sample mean
remains the same but the margin of error decreases as the confidence level is lower.
Therefore, the lower bound of the 95% confidence interval should be larger than the
lower bound of the 99% confidence interval. Similarly, the upper bound of the 95%
confidence interval should be smaller than the upper bound of the 99% confidence
interval.
By checking the options given, [1.61, 1.77] is the only plausible answer.
19. A 95% confidence interval for the population mean income is constructed from a
random sample. Which of the following statements is/are true? Select all that apply.
(A) A smaller sample size is likely to result in a narrower 95% confidence interval.
(B) For the same sample, a 90% confidence interval will be narrower.
(C) There is a 95% chance that the population mean income will fall within the
constructed confidence interval.
(D) The confidence interval constructed contains the sample mean.
(B) and (D) are correct. (A) is not correct since a smaller sample size is likely to result
in a wider (not narrower) 95% confidence interval. (B) is correct since a 90% confidence
level is lower than a 95% confidence level, the confidence interval constructed will be
narrower than the one constructed at 95% confidence level. (C) is not correct because
we do not use the word ‘chance’ that the population mean lies within a given range. It
is either within the range or not. (D) is correct since a confidence interval constructed
from a sample will always contain the sample mean.
20. A researcher working for the government wishes to estimate the average daily expense
for food, for singles staying alone in HDB flats in a particular neighbourhood. He
manages to obtain a list of all singles living in the neighbourhood and uses simple
random sampling to pick 100 singles. The sample estimate obtained is $30.20 and
the standard deviation is $5.62. Calculate the margin of error, correct to 2 decimal
8
places, for a 95% confidence interval of the population parameter. The corresponding
t∗ value for the 95% confidence interval is 1.984.
Based on the formula for the margin of error:
s
t∗ × √ ,
n
the answer correct to 2 decimal places is 1.12.
21. Joseph wishes to know how everyone in his class fares for GEA1000 quiz 1. He sent a
survey to everyone in the class asking for their quiz 1 scores and everyone responded
truthfully. He then constructed a 95% confidence interval for the mean score with the
data collected and found that the interval is [4.4, 6.8].
Which of the following statements must be true?
(I) The constructed confidence interval may not contain the population mean score.
(II) The population mean score can be found given the constructed confidence interval.
(A) Only (I).
(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Answer is (B). Note that Joseph reached out to the population of interest with 100%
response rate. With a census, there is in fact no need to construct a confidence
interval since the computed mean would be the population mean and not a sample
mean. Nevertheless, if Joseph went ahead to construct a 95% confidence interval, the
confidence interval would contain the population mean and (I) is false. The population
mean is (4.4 + 6.8)/2 = 5.6 and therefore (II) is true.
22. Which of the following statements about the p-value is/are true? Select all that apply.
(A) The p-value is the probability of obtaining results at least as favourable to the
alternative hypothesis as the collected data, computed based on the assumption
that the null hypothesis is true.
(B) The p-value gives the probability that the null hypothesis is true.
(C) At a 5% significance level, a p-value smaller than 0.05 provides sufficient evidence
that the null hypothesis should be rejected.
(D) At a 5% significance level, a p-value larger than 0.05 provides sufficient evidence
that the null hypothesis should be rejected.
(A) and (C) are correct. In a hypothesis test, the null hypothesis is assumed to be true,
we neither prove nor provide evidence that the null hypothesis is true. A significance
level is a threshold we set, so that any p-value below it indicates there is sufficient
evidence to reject the null hypothesis.
23. Suppose we want to test if a coin is biased towards heads. We decide to toss the coin
10 times and record the number of heads. We shall assume the independence of coin
tosses, so that the 10 tosses constitute a probability experiment.
Let X denote the number of heads occurring in 10 tosses of the coin. We will carry
out a hypothesis test with X as the test statistic. Let H be the event that the coin
lands on heads, in a single toss. We set our hypotheses to be
9
ˆ H0 : P (H) = 0.5,
ˆ H1 : P (H) > 0.5.
Suppose in our execution of the 10 tosses, we observe 4 heads. This means X = 4 is
the test result we observe.
Recall the definition of p-value to be the probability of obtaining a test result at least
as extreme as the one observed, assuming the null hypothesis is true. What is the
range of test results “at least as extreme as the one observed”, in this scenario?
(A) 0 ≤ X ≤ 4.
(B) 4 ≤ X ≤ 10.
(C) 0 ≤ X ≤ 5.
(D) 5 ≤ X ≤ 10.
Answer is (B). Note that in the context of p-value computation, “at least as extreme”
is interpreted as “at least as favourable to the alternative hypothesis”. In this scenario,
the greater the value X assumes, the more favourable the case is to the alternative
hypothesis. Hence, the answer should be all values of X greater than or equal to the
observed value.
24. A group of students wants to find out if there is any association between staying in
a hall and being late for class in NUS in a particular month. If students are late for
at least 5 classes, they are considered “late for class” in that month. After collecting
a random sample of 1000 students, they found that 200 out of 350 students who stay
in a hall are late for class, while 390 out of 650 students who do not stay in a hall are
late for class.
A chi-squared test was done to test for association between staying in a hall and being
late for class at 5% level of significance. The p-value derived from the chi-squared test
is 0.3809.
Which of the following statements is/are true? Select all that apply.
(A) There is a positive association between staying in a hall and being late at the
sample level.
(B) There is a negative association between staying in a hall and being late at the
sample level.
(C) Since the p-value is more than 0.05, we can conclude that there is an association
between staying in a hall and being late at the population level.
(D) Since the p-value is more than 0.05, we cannot conclude that there is an association between staying in a hall and being late at the population level.
(B) and (D) are correct. To check for association between staying in hall and being
late at the sample level, we compare
rate(Late for class | Staying in hall) =
<
200
350
390
= rate(Late for class | Not staying in hall)
650
Hence, we see that staying in hall is negatively associated with being late at the
sample level. As we are doing a chi-squared test to see if there is an association at the
population level, recall that the null hypothesis states that there is no association at
the population level, while the alternative hypothesis states that there is an association
at the population level. Since the p-value is greater than the level of significance, we do
10
not reject the null hypothesis. Hence, we cannot conclude that there is an association
between staying in hall and being late at the population level.
25. Everyday, a food packaging line fills thousands of bags of sugar that are supposed to
weigh 100g each. One day, the line supervisor suspects that the machines might be
putting more sugar in each bag. He randomly selects some of the bags to perform a
hypothesis test. What should his null hypothesis be?
(A) Bags on average, weigh more than 100g.
(B) Bags on average, weigh less than 100g.
(C) Bags on average, weigh 100g.
Answer is (C). The null hypothesis expresses the idea that “things” are behaving as
expected, in this case, this means that the bags on average weigh 100g.
26. 25 mothers were each allowed to smell two articles of infant’s clothing. Each of them
was then asked to pick the one which belongs to her infant. They were successful in
doing so 72% of the time. You want to show that this has not happened by chance and
mothers can indeed recognise the smell of their children. To test such a hypothesis,
what should the null and alternative hypotheses be?
(A) H0 : P(Success)= 0.5; H1 : P(Success)> 0.5.
(B) H0 : P(Success)= 0.72; H1 : P(Success)> 0.72.
(C) H0 : P(Success)= 0.5; H1 : P(Success)< 0.5.
(D) H0 : P(Success)= 0.72; H1 : P(Success)< 0.72.
Answer is (A). To show something does not happen by chance, we should have the
null hypothesis implying that it happens by chance, and the alternative hypothesis
implying that it does not. This is so that we can gather evidence from a probability
experiment to reject the null for the alternative. In this case, the null hypothesis is
that the chance that a mother is able to recognise her child smell correctly is 50%.
The alternative hypothesis should be in line with what we want to show, hence the
alternative hypothesis is H1 : P(Success)> 0.5.
27. Jane suspects that sleep duration is associated with a child’s mathematical ability.
She conducts a study to test her hypothesis. Which of the following should be her
null hypothesis?
(A) There is insufficient evidence to conclude sleep duration is associated with a
child’s mathematical ability.
(B) Sleep duration is not associated with a child’s mathematical ability.
(C) Sleep duration is associated with a child’s mathematical ability.
(D) There is sufficient evidence to conclude sleep duration is associated with a child’s
mathematical ability.
Answer is (B). In hypothesis tests involving the association between two variables,
the null hypothesis expresses the idea that a variable is not associated with another
variable. Thus in this case, the null hypothesis asserts that sleep duration is not
associated with a child’s mathematical ability.
28. A student conducts a one-sample t-test, at a 5% level of significance, in the following
way:
Null hypothesis H0 : µ = 20.
Alternative hypothesis H1 : µ > 20.
11
Which of the following statements is true?
(A) If the level of significance was increased to 10%, the p-value will remain the same.
(B) If the p-value > 0.05, H0 should be rejected.
(C) If H1 was changed to µ < 20, the level of significance will decrease.
(D) If H0 was rejected at a 5% level of significance, H0 will also be rejected at a 1%
level of significance for the same data set.
Answer is (A). The level of significance is the ‘cutoff value’ decided by the researchers,
and hence does not impact the p-value in any way, nor should it change when the
alternative hypothesis changes. If H0 was rejected at a 5% level of significance, it will
also be rejected at a higher level of significance, not necessarily at a lower level of
significance. H0 should not be rejected if p-value ≥ 0.05.
29. Jasmine is a sports psychologist who wants to investigate if participation rates in
competitive sports amongst Austrian youths under 12 years old is associated with
sports funding to primary schools.
She collects her data from a random sample of 16-year-old Austrian students, in the
following format:
Subject
Gender
001
002
···
···
M
F
···
···
Did you participate in any
sports competition before you
were 12 years old? (Y/N)
Yes
Yes
···
···
Was there sports funding
in your primary school?
(Y/N)
No
Yes
···
···
Select the most appropriate null hypothesis in her study.
(A) There is no association between Gender and Participation in sports competition
before 12 years old among 16-year-old Austrian students.
(B) There is an association between Gender and Participation in sports competition
before 12 years old among 16-year-old Austrian students.
(C) There is no association between Participation before 12 years old and Sports
Fundings among 16-year-old Austrian students.
(D) There is an association between Participation before 12 years old and Sports
Fundings among 16-year-old Austrian students.
Answer is (C). The first paragraph tells us that the researcher is interested in finding the relationship between Participation before 12 years old and providing Sports
Fundings. So, (A) and (B) are not suitable null hypotheses. (D) is the alternative
hypothesis for this study.
30. Professor Ali, a sleep researcher, claims that the average amount of sleep that students
in school XYZ get each day is 7 hours. Professor Lily believes that it is less than that,
as her students are always sleeping in her classes. She decided to conduct a hypothesis
test at the 5% significance level. She randomly sampled 153 students from the school
and obtained a sample average sleep duration of 5.5 hours. Which of the following
outcomes must Professor Lily consider in calculating the p-value for her hypothesis
test?
(A) Sample average sleep duration of 7 hours.
(B) Sample average sleep duration smaller than 7 hours.
12
(C) Sample average sleep duration greater than 5.5 hours.
(D) Sample average sleep duration smaller than 5.5 hours.
Answer is (D). The null hypothesis here is that the population mean sleep duration
for students in school XYZ is 7 hours. The alternative hypothesis here is that the
population mean sleep duration for students in school XYZ is less than 7 hours. From
the sample, the test result observed is the sample average sleep duration of 5.5 hours.
Hence, test results that are as extreme or more extreme than what was observed are
sample averages of sleep duration that are smaller than or equal to 5.5 hours obtained
from random samples drawn.
13
Download