01.+Warming+up

advertisement
Homework 1: Warm-up with Monte Carlo
GS 240: Data Science for Geoscience
Due date: Wednesday January 16, 1:30pm
This homework is intended as a warm-up on basic concepts in statistics and probability theory. You may
already be familiar with the basic material, so I have included some new elements as well. This assignment
will help you get familiar with these new elements. I will not cover them in class, however, with the help
of your basic knowledge and online resources you will be able to complete this successfully. Some of the
results of this homework will be used in Homework 2, so make sure that you have the correct results after
you get back the graded assignment. In that sense, in case you made errors, you can redeem part of your
grade by making the needed corrections.
I. Sampling from a known distribution using CDF inversion
Consider the lognormal distribution of a random variable 𝑋 with two parameters, the mean and variance.
What is the relationship between the mean and variance and the log-mean and log-variance? The logmean and log-variance is the mean and variance of 𝑙𝑜𝑔 𝑋 where 𝑋 has a log-normal distribution. Write a
program that generates a data set (does Monte Carlo) from the lognormal distribution, given 𝑁, the size
of the data set, mean and variance of 𝑋. What method of sampling is used in your code?
Consider now a so-called log-hyperbolic density distribution with four parameters:
fX ( x ) =
1
1 1 
  +     xK1  


(
)
2
 −
  +
exp  −
 2 + (log ( x ) −  ) +
(log ( x ) −  ) 
2
2


𝐾1 is the modified Bessel function of second kind. A python program that samples from this distribution
using the method of CDF (cumulative distribution function) inversion, give 𝑁, and the four parameters will
be provided. Please follow the instruction in https://github.com/lijingwang/GEOLSCI-240-ENERGY240/blob/master/hw1/tutorial_hw1.md and simply call the ‘log_hyperbolic_sampling’ function. With 𝑁 =
1000, it should be finished around 1 min.
II. Exploratory data analysis
Generate a sample of 1000 outcomes of the following two cases
Log-normal: log-mean = 1, log-standard deviation = 1
Log-hyperbolic: =2 = = =
For each data set, plot the log of the histogram (frequencies obtained by binning) vs the log of sample
values. Do you notice a difference?
Make a QQ-plot (quantile-quantile plot) between these two data sets. What do you notice?
III. Statistical convergence
Now you will study the property of some statistics of your datasets as you start increasing the sample size
𝑁.
For both the log-normal and log-hyperbolic distribution, use Monte Carlo to generate a plot of the
expected arithmetic average (empirical mean) of a dataset of size 𝑁 as function of 𝑁. Because the
simulations for the log-hyperbolic take time, use 𝑁 = 10, 20, 30, 50, 70, 100, 200, 500, 1000. In that plot,
also plot the true 90% confidence interval of that arithmetic average as function of sample size. Note that
the 90% confidence interval is simply obtained by doing many (𝐵) Monte Carlo simulations for fixed 𝑁
and then taking the 5% and 95% quantile of the sample set of 𝐵 values.
Make the same plot for the empirical variance.
Do this for the following parameter settings:
Log-normal, 2 cases: log-mean = 1, log-standard deviation = 1; log-mean = 1, log-standard deviation = 3
Log-hyperbolic, 2 cases: =2 = = =  =0.3 = = =
Describe the results. What can you conclude from these plots?
Download