Data analytics using R and for the Geosciences. This short course has the goal of empowering scientists and engineers to use modern methods of data analysis. A typical first course in statistics focuses on the formal treatment of simple statistical methods illustrating them using small data sets. Moreover developing skill with a data analysis language is not taught and there is little experience with the more open ended exploration of complex data sets. This course complements a traditional introduction to statistics by developing analysis skills with larger and more complex data sets and also building an appreciation for the value of more advanced statistical methods. The framework for this approach is learning how to use the R statistical environment for data analysis. R is a publicly available software package that is developed by the statistics community and is a standard for current statistical methodology. In addition it has the flexibility to support a range of users: from students learning statistics all the way through scientists and engineers pursuing cutting edge data analysis for research and commercial applications. This course will teach students how to use R in a hands-on and tutorial environment and explore substantial data sets in a way that motivate different statistical methods. Although the data sets will larger be drawn from the environmental or Earth sciences students will be encouraged to consider data from other fields that match their interests. Course outline. The format for each day consists of new ideas and material being presented in the morning session with time in the afternoon to allow students to work through examples and on more open ended projects. Some work will be done in small teams to encourage interaction among the class. The plan below is intended to run through 4.5 days. R for data analysis AM R Bootcamp: Manipulating data sets/ Summary statistics/ Basic graphics Installing R, installing packages, loading packages (install vs. load) ‘etc’ folder use for having R always load certain packages Loading data into R using ‘read.csv()’ and loading data using Rcmdr ex. Controlled Burn data on vegetation (violet <- read.csv(“violets.csv”,header=TRUE)) -correct form for data involves different variables for each column of spreadsheet and not have levels of variable representing different columns Subsetting data (subset violet for particular species)…in R script and Rcmdr Creation of new variables…in R script and Rcmdr Factors vs. Continuous variables…why we care Project: Give a data set (say NC bassfishing data…bassfish.csv) and ask for folks in groups of 3 to answer some questions…maybe have each group do questions 1-3 and then assign each group a more difficult question such as what are given in 4-6 After reading data into R, do the following: 1. What is average age female bassfishermen in NC? What is the standard deviation of the female ages? 2. Does the variable in (1) have a Normal distribution? Why do you think so/not? 3. Using side-by-side boxplots, determine if males willing to pay more than females regarding fishing costs. 4. Argue whether “willingness-to-pay” depends on education level. 5. Argue whether there is/is not a greater proportion of married fishermen in NC than in SC? 6. Argue whether or not “willingness to pay” depends on how successful the person was at fishing last year. PM Simple programming in R: Building plots/writing functions/ statistical distributions Basics in writing a function: functions in R associated with ‘( )’ Notion of arguments in a function Every function ‘returns’ something(s) and this needs to be part of the code Show how to write a simple function in which the mean, median, and standard deviation (or whatever statistics we want) are computed for a single column of data Use the function on a column of data from the bassfish.csv data set…check our values from the function with “Numerical Summaries” approach in Rcmdr. Idea of bootstrapping…Rock Wren Data from New Zealand Project 2: Take a data set from a stratified random sample and write a boostrap confidence interval on the median Concept: Sample vs. population Relationships between variables AM: Correlation, linear regression and least squares fitting Vampire device data from “kill-a-watt” to motivate use of linear regression Show something with quadratic relationship but still linear model Show something with a weird pattern that we would use a nonparametric smooth for (ex. loess) Show scatterplot…use plot to ‘prescribe’ a functional form Idea of least squares Idea of Normality prediction equation notion of MSE and maybe relate this back to ‘s’ in the univariate setting. Project 3: Give a data set and get them to estimate the parameters of the model, estimate mean for given value of covariate, get a prediction interval and get a confidence interval and interpret both of these. PM: The regression zoo: Count data, a nonlinear response, smooth curves Show non-linear response (maybe a growth curve data set) Show binary response Show count data (horseshoe crab satellite data) Show log-normal data (spotfin chub catch per unit effort) Project 4: Give binary response data set…maybe resource selection with presence/absence of wolves (I have a data set that we could use) in grid cells and “road density” as covariate…get them to come up with prediction equation and brainstorm on how to use this equation to produce a resource selection map. Give count data and get them to come up with prediction equation. Concept: Correlation vs. causation Several variables AM Correlation revisted: Multivariate data and clustering. PM Data collected over time Concept: Curse of dimensionality Geophysical data AM Visualizing spatial data and fitting surfaces. PM Climate and weather: Data collected over both time and space. Concept: Uncertainty of a prediction Wrap up AM Project reports and discussion.