RAnalyticsCourse

advertisement
Data analytics using R and for the Geosciences.
This short course has the goal of empowering scientists and engineers to use modern methods of data analysis. A typical first
course in statistics focuses on the formal treatment of simple statistical methods illustrating them using small data sets.
Moreover developing skill with a data analysis language is not taught and there is little experience with the more open ended
exploration of complex data sets. This course complements a traditional introduction to statistics by developing analysis skills
with larger and more complex data sets and also building an appreciation for the value of more advanced statistical methods.
The framework for this approach is learning how to use the R statistical environment for data analysis. R is a publicly available
software package that is developed by the statistics community and is a standard for current statistical methodology. In
addition it has the flexibility to support a range of users: from students learning statistics all the way through scientists and
engineers pursuing cutting edge data analysis for research and commercial applications. This course will teach students how
to use R in a hands-on and tutorial environment and explore substantial data sets in a way that motivate different statistical
methods. Although the data sets will larger be drawn from the environmental or Earth sciences students will be encouraged to
consider data from other fields that match their interests.
Course outline.
The format for each day consists of new ideas and material being presented in the morning session with time in the afternoon
to allow students to work through examples and on more open ended projects. Some work will be done in small teams to
encourage interaction among the class. The plan below is intended to run through 4.5 days.
R for data analysis

AM R Bootcamp: Manipulating data sets/ Summary statistics/ Basic graphics
Installing R, installing packages, loading packages (install vs. load)
‘etc’ folder use for having R always load certain packages
Loading data into R using ‘read.csv()’ and loading data using Rcmdr
ex. Controlled Burn data on vegetation (violet <- read.csv(“violets.csv”,header=TRUE))
-correct form for data involves different variables for each column of spreadsheet and not have
levels of variable representing different columns
Subsetting data (subset violet for particular species)…in R script and Rcmdr
Creation of new variables…in R script and Rcmdr
Factors vs. Continuous variables…why we care
Project: Give a data set (say NC bassfishing data…bassfish.csv) and ask for folks in groups of 3 to answer
some questions…maybe have each group do questions 1-3 and then assign each group a more difficult
question such as what are given in 4-6
After reading data into R, do the following:
1.
What is average age female bassfishermen in NC? What is the standard deviation of the female ages?
2.
Does the variable in (1) have a Normal distribution? Why do you think so/not?
3.
Using side-by-side boxplots, determine if males willing to pay more than females regarding fishing costs.
4.
Argue whether “willingness-to-pay” depends on education level.
5.
Argue whether there is/is not a greater proportion of married fishermen in NC than in SC?
6.
Argue whether or not “willingness to pay” depends on how successful the person was at fishing last year.

PM Simple programming in R: Building plots/writing functions/ statistical distributions
Basics in writing a function: functions in R associated with ‘( )’
Notion of arguments in a function
Every function ‘returns’ something(s) and this needs to be part of the code
Show how to write a simple function in which the mean, median, and standard
deviation (or whatever statistics we want) are computed for a single column of data
Use the function on a column of data from the bassfish.csv data set…check our values from the function
with “Numerical Summaries” approach in Rcmdr.
Idea of bootstrapping…Rock Wren Data from New Zealand
Project 2: Take a data set from a stratified random sample and write a boostrap confidence interval on the median

Concept: Sample vs. population
Relationships between variables

AM: Correlation, linear regression and least squares fitting
Vampire device data from “kill-a-watt” to motivate use of linear regression
Show something with quadratic relationship but still linear model
Show something with a weird pattern that we would use a nonparametric smooth for (ex. loess)
Show scatterplot…use plot to ‘prescribe’ a functional form
Idea of least squares
Idea of Normality
prediction equation
notion of MSE and maybe relate this back to ‘s’ in the univariate setting.
Project 3: Give a data set and get them to estimate the parameters of the model, estimate mean for
given value of covariate, get a prediction interval and get a confidence interval and interpret both of these.

PM: The regression zoo: Count data, a nonlinear response, smooth curves
Show non-linear response (maybe a growth curve data set)
Show binary response
Show count data (horseshoe crab satellite data)
Show log-normal data (spotfin chub catch per unit effort)
Project 4: Give binary response data set…maybe resource selection with presence/absence of wolves
(I have a data set that we could use) in grid cells and “road density” as covariate…get them to come up
with prediction equation and brainstorm on how to use this equation to produce a resource selection
map. Give count data and get them to come up with prediction equation.

Concept: Correlation vs. causation
Several variables

AM Correlation revisted: Multivariate data and clustering.

PM Data collected over time

Concept: Curse of dimensionality
Geophysical data

AM Visualizing spatial data and fitting surfaces.

PM Climate and weather: Data collected over both time and space.

Concept: Uncertainty of a prediction
Wrap up

AM Project reports and discussion.
Download