Nanyang Business School AB1202 Statistics & Analysis Tutorial Topics :6 : Sampling & Central Limit Theorem 1. Random sampling requires the iid condition (independence and identical distribution of sample points). Suppose I plan to conduct a study about the GPA (of an AY) of current NTU students, as a measure of learning outcome. I am doing it at the end of an AY. Hence, each student will have at least one record on file. Following the steps of statistical inference, I assume the GPA scores follow a particular distribution in the population (NTU students) and then proceed to collect a random sample (sample size n=30). Note that, in this study, a sample point is a student-AY combination. (1) For this to be a random sample, what do the distributions of the 30 students in this sample have to satisfy in the context of this study? (2) For my own convenience, I randomly selected 10 Year-3 NBS students as the subjects of this study. Taken together, they will provide at least 30 data points (years 1—3). How would this research design affect the validity of the independent and identical distribution assumptions? (3) If I randomly select 30 Year-3 NTU students and only get their Year-3 GPA scores as my sample, what can be the issue from the perspective of iid? 2. Consider the “ChickWeight” dataset available in R. (1) How to use R to count the number of chickens in the sample whose weights are less than 100g? (Hint: combine the “[]” method learned in W5 and the command “length()”) (2) What are the minimum, maximum, mean, median, standard deviation, and IQR of the weight of those chickens that were fed on Diet #2? And for those on Diet #3? (3) Generate a boxplot and a scatterplot for weights against time (weight in the vertical axis and time in the horizontal axis). 3. “Seatbelts” contains the monthly road casualties in Great Britain from 1969 to 1984. Use help(Seatbelts) and View(Seatbelts) to see more detail about this dataset. SIDE NOTE: Unlike ChickWeight, Seatbelts is a time-series object (instead of a dataframe). In a time-series object, each row is linked to a specific timestamp. Use print(Seatbelts) or plot (Seatbelts). To extract data from it, use the two-index method Seatbelts[__ ,__ ]. Replace the “__” as needed. (1) What are the sample mean and sample median of “drivers” (drivers killed or seriously injured)? (2) Did the sample mean & median increase or decrease after the law of compulsory 1 Nanyang Business School wearing of seat belts was introduced (check the “law” variable)? (3) Calculate the sample mean and standard deviation of “drivers+front”? 4. Consider the following five-point summary that was obtained from a data set with 200 observations. Min 1st Quartile Median 3rd Quartile Max (Q1) (Q3) 34 54 66 78 98 (1) Interpret Q1 and Q3. (2) Calculate the interquartile range. (3) Determine whether any outliers exist. (4) If we want to construct a boxplot based on the information in table, determine the values of the lower limit and upper limit. 5. The figure below contains the boxplot of daily closing prices of major European stock indices between 1991 and 1998 (> boxplot(EuStockMarkets)). Complete the table below based on the boxplot: The index with the highest… Minimum Maximum Upper limit Lower limit IQR The index with the lowest… 6. The R command, such as rbinom, rnorm(), generate samples from the underlying distribution. For example, rbinom(2,5,.5) generates 2 samples from a binominal 2 Nanyang Business School distribution that involves 5 trials and a success probability of 0.5. This command simulates two possible outcomes from the above binomial experiment. But, how do computers do this? After all, computers cannot flip coins or throw dice and computers can only follow predetermined computer codes. How does a computer generate a “randomized” result? In fact, computers, like robots, can only follow a prescribed set of instructions. Computer scientists and mathematicians alike must define the steps to generate random numbers,1 which can be subsequently used to generate the values for random variables. Fortunately, scientists have developed formulas to calculate different sequences of numbers that are seemingly random, which are called pseudo-random numbers. We can instruct R to use a particular sequence of pseudo-random numbers by setting the seed value. The seed value controls the randomness of all subsequent commands that call for randomization (like a seed). The R-command for doing that is set.seed(). You may choose any integer value as the seed value. For example, >set.seed(100) >x=rnorm(5,0,1) >x -0.50219235 0.13153117 -0.07891709 0.88678481 0.11697127 Now, if you run rnorm(5,0,1) (or rnorm(1,0,1) 5 times) after set.seed(100), you will obtain the same five values for this normal distribution. Without specifying the set.seed value upfront, the same rnorm command would generate different values each time you use it (as it is supposed to). Please do the following steps: (1) Without setting the seed value, draw 3 values from a standard normal distribution and another 3 values from a uniform distribution between zero and one. Compare your results with others’. (2) Now set the seed to 50 and then repeat the above. Compare your results with others’. (3) Now set the seed to 50 again and then draw 3 values from a uniform distribution between zero and one and another 3 values from a standard normal distribution. Compare the results with those from (2). 7. A for-loop lets us repeat (loop) through the elements in a vector and run the same code on each element. The basic syntax for creating a for-loop statement in R is for (value in vector) { statements } 1 A random number is a discrete uniform distribution over a fixed interval. Result from a toss of dice will give a random number from 1 to 6. 3 Nanyang Business School (1) Use the for-loop to calculate the sum of the first 100 squares 12 + 22 + ⋯ + 1002 =? (2) Set the seed to 100 prior to the for-loop. Generate a value for X1 + 𝑋2 + ⋯ + 𝑋100 , where 𝑋𝑖 follows U[0,i]. 8. The graduation rate of a training program is estimated to be 30%. The program admits 10 trainees per month. Assume each trainee’s graduation probability is independent of another’s. (1) Suppose the admission system only keeps data from the most recent three months. If we were to use the data to calculate the average number of graduates per month, what would the distribution of the average be like? To answer this question, first think about what distribution would be appropriate? Then, use a for-loop to generate 1000 sample mean observations. Finally, create a histogram of the distribution. For this question, set the seed value =1 prior to the for-loop. (2) Repeat (1), but use 30 months instead of 3 months. Compare the two histograms and identify the main difference. (3) Following (2). If we approximate the sampling distribution by a normal distribution, calculate the probability that the average is less than 2.5/month based on the normal distribution. NOTE: the variance of a binomial RV = (# trials)*p*(1-p). Its expected value = (# trials)*p. 4