FROM N00B TO PRO Jack Davis Andrew Henrey

advertisement
FROM N00B TO PRO
Jack Davis
Andrew Henrey
PURPOSE
Create a simulator from scratch that:
•
Generates data from a variety of distributions
•
Makes a response variable from a known function of the data (plus an error term)
•
Constructs a linear model that estimates the coefficients of the function
•
Repeats generation and modeling many times to compare the average estimates of the
linear model to the known parameters.
•
Package the whole thing nicely into a function that we can call in a single line in later
work.
•
If you’re experienced, the commands themselves may seem trivial
OUTLINE
•
1) Learning how to learn
•
2) Randomly Generating Data
•
3) Data Frames and Manipulation
•
4) Linear Models
•
BREAK – Quality of presenter improves
•
5) Running loops
•
6) Function Definition
•
7) More advanced function topics
•
8) Using functions
•
9) A short simulation study
LEARNING HOW TO LEARN – JACK DAVIS
•
Google CRAN Packages to get the package list
• From here you can get a description of every command in a package.
•
??<term> searches for commands related to <term>
• ??plot will find commands related to plot
•
?<command> calls up the help file for that command
• ?abline gives the help file for the abline() command.
LEARNING HOW TO LEARN – JACK DAVIS
•
Exercises:
•
Name one function in the darts game package.
•
What is the e-mail of the author of the Texas Holdem simulation package?
•
(Bonus) Tell the author about your day via e-mail; s/he likes hearing from fans.
•
Find a function to make a histogram
•
Find some example code on the heatmap() command.
RANDOMLY GENERATING DATA – JACK DAVIS
•
The r<dist> commands randomly generate data from a distribution
•
rnorm( n , mean, sd)
•
rexp( n, rate)
•
rbinom( n, size, prob)
•
rt( n, df) From Student’s T. (Mean is zero, so setting a mean is up to you)
•
set.seed() Allows you to generate the same data every time, so you or others can
verify work.
Generates from normal distribution (default N(0,1))
RANDOMLY GENERATING DATA – JACK DAVIS
•
Set a random seed
•
Generate a vector of 50 values from the Normal (mean=10,sd=4) distribution, name
the vector x1.
•
Do the same with
• Poisson ( lambda = 5), named x2,
• Exponential (rate = 1/7) named x3,
• Student’s t distribution (df =5), with a mean of 5, named x4,
• Normal (mean=0, sd=20), named err
•
Make a new variable y, let it be 3 + 20x1 + 15x2 – 12x3 – 10x4 + err
DATA FRAMES – JACK DAVIS
•
data.frame() makes a dataframe object of the vectors listed in the ()
•
The advantage of having a data frame is that it can be treated as a single object..
•
Data frames, models, and even matrix decompositions can be objects in R.
•
You can call parts of objects by name using $
•
model$coef or model$coefficient will bring up the estimated coefficients
•
If no such aspect exists, then you’ll get a null response.
•
Example: Andrew$height
DATA FRAMES – JACK DAVIS
•
Exercises:
•
Make a data.frame() of x1,x2,x3,x4, and y Name it dat
•
(if you’re stuck from the last part, run “Q3-dataframethis.txt” first)
•
Use index indicators like dat[4,3], dat [2:7,3], dat [4,], and dat [4,-1] to get
• The 3rd row, 5th entry of dat
• The 2nd – 7th values of the 5 th column
• The entire 3rd row
• The 3rd row without the 1st entry
LINEAR MODELS – JACK DAVIS
•
The results of the lm() function are an object.
•
Example: mod = lm(y ~ x1 + I(x2^2) + x1:x2, data=dat)
•
Useful aspects
• mod$fitted
• mod$residuals
•
Useful functions
• summary(mod)
• predict(mod, newdata)
LINEAR MODELS – JACK DAVIS
•
Use the lm command to create a linear model of y as a function of x1,x2,x3, and x4
additively using dat data, name it mod. (No interactions or transformations)
•
Get the summary of mod
•
Display the estimated coefficients with no other values.
BREAK
•
This slide unintentionally left 98% blank
OUTLINE
•
•
•
•
•
BREAK – Quality of presenter improves deteriorates
•
5) Running loops
•
6) Function Definition
•
7) More advanced function topics
•
8) Using functions
•
9) A short simulation study
RUNNING YOU FOR A LOOP – ANDREW HENREY
•
Similar to other programming languages, loops in R allow you to repeat the same block of
code several times
•
Unlike other programming languages, large loops in R are exceedingly slow
•
Any loop of less than about 100,000 total iterations is not going to give you much trouble
in terms of time
RUNNING YOU FOR A LOOP – ANDREW HENREY
•
An R loop that executes a million commands takes about a second. Conditions vary wildly
•
Generating 100,000 data sets of size 50,000 and looping through the dataset to calculate
a mean for each one would take longer to run than Jack Davis heading up Burnaby
Mountain (ouch)
Sup d00dz late for tutorial 
Jack
RUNNING YOU FOR A LOOP – ANDREW HENREY
•
Loop syntax:
for (i in 1:n)
{
#TellVicEverything
}
RUNNING YOU FOR A LOOP – ANDREW HENREY
•
No need to run from 1:K
•
Can use an arbitrary vector instead
•
Runs for length(vect) iterations
•
Takes on the i th value of the vector each iteration
•
e.g.
•
V = c(1,5,3,-6)
•
for (count in V) {print(count);}
•
## 1, 5 , 3 , -6
RUNNING YOU FOR A LOOP – ANDREW HENREY
•
Exercises:
•
A) Define a variable runs to be the number 10,000
•
B) Define a matrix() called mat with 5 columns and runs rows (10,000 rows)
•
C) Put a for() loop around the code found in q5-loopthis.txt . Loop from 1:runs.
Use index indicators like a[k,] to save the estimated coefficients of the model in a
new row of mat.
OR, if you think you are a total coding BOSS, then put the loop around your code in
parts 2-4 that generates data and finds the linear model estimates of the betas.
FUNCTION DEFINITION – ANDREW HENREY
•
Functions are a slightly abstract concept
•
Mathematics: f(x) = x 2+4x-16
•
Computing: mean(x) = sum(x)/length(x) – All 3 are functions!
•
Functions map INPUTS to OUTPUT
•
Possibly no inputs
•
In one way or another, always some form of output
•
Example functions:
•
SORT, MEDIAN, OPTIM / NLM, LM/GLM
FUNCTION DEFINITION – ANDREW HENREY
•
Function syntax
•
Simple function:
F = function()
{
return (5)
}
>> F()
5
FUNCTION DEFINITION – ANDREW HENREY
•
Exercises:
•
Make a function out of the code you wrote in part 5. The syntax should be similar to the
previous slide. The function:
• Should be called simulate.lm
• Should include everything needed to generate the data several times, find a linear
model, and extract the coefficients
• Does NOT take any inputs
• Should return the matrix of 10,000 runs of coefficients
•
Use the function and save the results to a matrix called test
•
If nothing is working (), you can use the example code in q6 – function this.txt
ADVANCED FUNCTIONS – ANDREW HENREY
•
A more complicated example:
MSE = function(X=c(0,3,11),Y)
{
return (mean((X-Y)^2))
}
•
Observe that X has default values
>>MSE(Y=c(4,5,6))
15
ADVANCED FUNCTIONS – ANDREW HENREY
•
If an input argument to a function has default values, you don’t have to specify them when
calling the function
•
If an input argument has no default values, running the function without specifying them
gives you an error
ADVANCED FUNCTIONS – ANDREW HENREY
•
Exercises:
•
Modify simulate.lm() by adding input parameters. Include:
•
nruns, the number of runs in the simulation, with no default
•
seed, the random initial seed of the simulation, defaulting to 1337
•
verbose, a Boolean true/false to report progress, defaulting to FALSE (caps matter)
•
Set runs to nruns at the beginning of your function
•
Use set.seed(seed) in your function
•
Add code that prints out how far along you are in the loop, but only when verbose is true
•
Run this new function to overwrite the old one
USING FUNCTIONS – ANDREW HENREY
•
Exercises:
•
A) Run your simulate.lm() with 25000 runs. Store these results as a variable
called betas
•
B) Use hist() on the first column of betas to see the sampling distribution of the
intercept
•
C) Use summary(), mean() , and sd() on this column as well
•
D) Use par(mfrow=c(2,2)) and then some hist() commands to display
histograms of the other four sampling distributions in a 2x2 grid
•
E) Compare your results to the known values (The means of the sampling distributions to
the true values, and the standard deviations to the estimated values in Q4)
SIMULATION STUDY – ANDREW HENREY
•
Idea: You have a binomial experiment with 9 successes and 3 failures.
•
You would like to construct a 95% CI for the true proportion of successes
•
You DON’T know whether the normal approximation is appropriate
•
How can we find out whether or not it’s OK?
SIMULATION STUDY – ANDREW HENREY
•
Overall procedure:
•
Construct a LOT of samples from a population with 12 trials and p=0.75
•
For each sample, calculate the 95% CI using the normal approximation
•
For each sample, see whether the CI overlaps with 0.75
•
Count the number of samples for which the CI overlaps with 0.75
•
The proportion of the samples that have a CI that overlaps is called the “true coverage
probability”
•
If the true coverage probability is close to 95% , the normal approximation to the sampling
distribution of p is a good one.
SIMULATION STUDY – ANDREW HENREY
•
Steps:
•
Generate 100,000 samples of binomial(12,0.75) data using rbinom()
•
For each sample, calculate the usual estimate of p
•
For each sample, calculate SE = sqrt(p*(1-p)/12)
•
For each sample, calculate the lower and upper bounds of the 95% CI
•
Find out how many intervals actually contain 0.75
•
Optional: Look at hist(x) to gain intuition of why the normal approximation isn’t
perfect
THE END
•
Leave plz tyvm
Download