Homework 1: Warm-up with Monte Carlo GS 240: Data Science for Geoscience Due date: Wednesday January 16, 1:30pm This homework is intended as a warm-up on basic concepts in statistics and probability theory. You may already be familiar with the basic material, so I have included some new elements as well. This assignment will help you get familiar with these new elements. I will not cover them in class, however, with the help of your basic knowledge and online resources you will be able to complete this successfully. Some of the results of this homework will be used in Homework 2, so make sure that you have the correct results after you get back the graded assignment. In that sense, in case you made errors, you can redeem part of your grade by making the needed corrections. I. Sampling from a known distribution using CDF inversion Consider the lognormal distribution of a random variable 𝑋 with two parameters, the mean and variance. What is the relationship between the mean and variance and the log-mean and log-variance? The logmean and log-variance is the mean and variance of 𝑙𝑜𝑔 𝑋 where 𝑋 has a log-normal distribution. Write a program that generates a data set (does Monte Carlo) from the lognormal distribution, given 𝑁, the size of the data set, mean and variance of 𝑋. What method of sampling is used in your code? Consider now a so-called log-hyperbolic density distribution with four parameters: fX ( x ) = 1 1 1 + xK1 ( ) 2 − + exp − 2 + (log ( x ) − ) + (log ( x ) − ) 2 2 𝐾1 is the modified Bessel function of second kind. A python program that samples from this distribution using the method of CDF (cumulative distribution function) inversion, give 𝑁, and the four parameters will be provided. Please follow the instruction in https://github.com/lijingwang/GEOLSCI-240-ENERGY240/blob/master/hw1/tutorial_hw1.md and simply call the ‘log_hyperbolic_sampling’ function. With 𝑁 = 1000, it should be finished around 1 min. II. Exploratory data analysis Generate a sample of 1000 outcomes of the following two cases Log-normal: log-mean = 1, log-standard deviation = 1 Log-hyperbolic: =2 = = = For each data set, plot the log of the histogram (frequencies obtained by binning) vs the log of sample values. Do you notice a difference? Make a QQ-plot (quantile-quantile plot) between these two data sets. What do you notice? III. Statistical convergence Now you will study the property of some statistics of your datasets as you start increasing the sample size 𝑁. For both the log-normal and log-hyperbolic distribution, use Monte Carlo to generate a plot of the expected arithmetic average (empirical mean) of a dataset of size 𝑁 as function of 𝑁. Because the simulations for the log-hyperbolic take time, use 𝑁 = 10, 20, 30, 50, 70, 100, 200, 500, 1000. In that plot, also plot the true 90% confidence interval of that arithmetic average as function of sample size. Note that the 90% confidence interval is simply obtained by doing many (𝐵) Monte Carlo simulations for fixed 𝑁 and then taking the 5% and 95% quantile of the sample set of 𝐵 values. Make the same plot for the empirical variance. Do this for the following parameter settings: Log-normal, 2 cases: log-mean = 1, log-standard deviation = 1; log-mean = 1, log-standard deviation = 3 Log-hyperbolic, 2 cases: =2 = = = =0.3 = = = Describe the results. What can you conclude from these plots?