Uploaded by Heyram Ravi

Questions week 6

advertisement
Nanyang Business School
AB1202 Statistics & Analysis
Tutorial
Topics
:6
: Sampling & Central Limit Theorem
1. Random sampling requires the iid condition (independence and identical distribution of
sample points). Suppose I plan to conduct a study about the GPA (of an AY) of current NTU
students, as a measure of learning outcome. I am doing it at the end of an AY. Hence, each
student will have at least one record on file.
Following the steps of statistical inference, I assume the GPA scores follow a particular
distribution in the population (NTU students) and then proceed to collect a random
sample (sample size n=30). Note that, in this study, a sample point is a student-AY
combination.
(1) For this to be a random sample, what do the distributions of the 30 students in this
sample have to satisfy in the context of this study?
(2) For my own convenience, I randomly selected 10 Year-3 NBS students as the subjects
of this study. Taken together, they will provide at least 30 data points (years 1—3).
How would this research design affect the validity of the independent and identical
distribution assumptions?
(3) If I randomly select 30 Year-3 NTU students and only get their Year-3 GPA scores as my
sample, what can be the issue from the perspective of iid?
2. Consider the “ChickWeight” dataset available in R.
(1) How to use R to count the number of chickens in the sample whose weights are less
than 100g? (Hint: combine the “[]” method learned in W5 and the command
“length()”)
(2) What are the minimum, maximum, mean, median, standard deviation, and IQR of the
weight of those chickens that were fed on Diet #2? And for those on Diet #3?
(3) Generate a boxplot and a scatterplot for weights against time (weight in the vertical
axis and time in the horizontal axis).
3. “Seatbelts” contains the monthly road casualties in Great Britain from 1969 to 1984. Use
help(Seatbelts) and View(Seatbelts) to see more detail about this dataset.
SIDE NOTE: Unlike ChickWeight, Seatbelts is a time-series object (instead of a dataframe).
In a time-series object, each row is linked to a specific timestamp. Use print(Seatbelts) or
plot (Seatbelts). To extract data from it, use the two-index method Seatbelts[__ ,__ ].
Replace the “__” as needed.
(1) What are the sample mean and sample median of “drivers” (drivers killed or seriously
injured)?
(2) Did the sample mean & median increase or decrease after the law of compulsory
1
Nanyang Business School
wearing of seat belts was introduced (check the “law” variable)?
(3) Calculate the sample mean and standard deviation of “drivers+front”?
4. Consider the following five-point summary that was obtained from a data set with 200
observations.
Min
1st
Quartile Median
3rd
Quartile Max
(Q1)
(Q3)
34
54
66
78
98
(1) Interpret Q1 and Q3.
(2) Calculate the interquartile range.
(3) Determine whether any outliers exist.
(4) If we want to construct a boxplot based on the information in table, determine the
values of the lower limit and upper limit.
5. The figure below contains the boxplot of daily closing prices of major European stock
indices between 1991 and 1998 (> boxplot(EuStockMarkets)).
Complete the table below based on the boxplot:
The index with the highest…
Minimum
Maximum
Upper limit
Lower limit
IQR
The index with the lowest…
6. The R command, such as rbinom, rnorm(), generate samples from the underlying
distribution. For example, rbinom(2,5,.5) generates 2 samples from a binominal
2
Nanyang Business School
distribution that involves 5 trials and a success probability of 0.5. This command simulates
two possible outcomes from the above binomial experiment. But, how do computers do
this? After all, computers cannot flip coins or throw dice and computers can only follow
predetermined computer codes. How does a computer generate a “randomized” result?
In fact, computers, like robots, can only follow a prescribed set of instructions. Computer
scientists and mathematicians alike must define the steps to generate random numbers,1
which can be subsequently used to generate the values for random variables. Fortunately,
scientists have developed formulas to calculate different sequences of numbers that are
seemingly random, which are called pseudo-random numbers.
We can instruct R to use a particular sequence of pseudo-random numbers by setting the
seed value. The seed value controls the randomness of all subsequent commands that call
for randomization (like a seed). The R-command for doing that is set.seed(). You may
choose any integer value as the seed value. For example,
>set.seed(100)
>x=rnorm(5,0,1)
>x
-0.50219235 0.13153117 -0.07891709
0.88678481 0.11697127
Now, if you run rnorm(5,0,1) (or rnorm(1,0,1) 5 times) after set.seed(100), you will obtain
the same five values for this normal distribution. Without specifying the set.seed value
upfront, the same rnorm command would generate different values each time you use it
(as it is supposed to).
Please do the following steps:
(1) Without setting the seed value, draw 3 values from a standard normal distribution and
another 3 values from a uniform distribution between zero and one. Compare your
results with others’.
(2) Now set the seed to 50 and then repeat the above. Compare your results with others’.
(3) Now set the seed to 50 again and then draw 3 values from a uniform distribution
between zero and one and another 3 values from a standard normal distribution.
Compare the results with those from (2).
7. A for-loop lets us repeat (loop) through the elements in a vector and run the same code
on each element. The basic syntax for creating a for-loop statement in R is
for (value in vector) {
statements
}
1
A random number is a discrete uniform distribution over a fixed interval. Result from a toss of dice
will give a random number from 1 to 6.
3
Nanyang Business School
(1) Use the for-loop to calculate the sum of the first 100 squares 12 + 22 + ⋯ +
1002 =?
(2) Set the seed to 100 prior to the for-loop. Generate a value for X1 + 𝑋2 + ⋯ + 𝑋100 ,
where 𝑋𝑖 follows U[0,i].
8. The graduation rate of a training program is estimated to be 30%. The program admits 10
trainees per month. Assume each trainee’s graduation probability is independent of
another’s.
(1) Suppose the admission system only keeps data from the most recent three months. If
we were to use the data to calculate the average number of graduates per month,
what would the distribution of the average be like? To answer this question, first think
about what distribution would be appropriate? Then, use a for-loop to generate 1000
sample mean observations. Finally, create a histogram of the distribution. For this
question, set the seed value =1 prior to the for-loop.
(2) Repeat (1), but use 30 months instead of 3 months. Compare the two histograms and
identify the main difference.
(3) Following (2). If we approximate the sampling distribution by a normal distribution,
calculate the probability that the average is less than 2.5/month based on the normal
distribution. NOTE: the variance of a binomial RV = (# trials)*p*(1-p). Its expected
value = (# trials)*p.
4
Download