Stat 579: More Preliminaries, Reading from Files Ranjan Maitra 2220 Snedecor Hall Department of Statistics Iowa State University. Phone: 515-294-7757 maitra@iastate.edu September 1, 2011 , 1/10 Some more introductory examples – I Let us make a vector containing the sequence 1 through 20: > x <- 1:20 How do we call this object? To do that, we simply type: > x Let us try a simple operation on this object: > w <- 1 + sqrt(x)/2 This operation takes element-wise square root of the vector x and adds 1 to each coordinate. Moving on, can we get what this does? > dummy <- data.frame(x = x, y = x + rnorm(x)*w) > dummy and we make a “data frame” of two columns, x and y and look at it. , 1/10 Some more introductory examples – II Consider the following: > fm <- lm(formula = y ∼ x, data=dummy) > summary(fm) Call: lm(formula = y ∼ x, data = dummy) Residuals: Min 1Q Median 3Q Max -3.6315 -0.8137 0.2134 0.8470 5.0178 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.63569 0.97234 1.682 0.11 x 0.84072 0.08117 10.358 5.19e-09 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 2.093 on 18 degrees of freedom Multiple R-squared: 0.8563, Adjusted R-squared: 0.8483 F-statistic: 107.3 on 1 and 18 DF, p-value: 5.187e-09 We fit a simple linear regression of y on x, store as a dataframe and look at the results. , 2/10 Some more introductory examples – III > attach(dummy) Make the columns in the data frame visible as variables. > plot(x = x, y = y) > abline(a = 0, b = 1, lty=3) # The true regression line: (intercept 0, slope 1). > abline(coef(fm)) # The simple linear regression line. > detach() Removed data frame from the search path. > plot(x = fitted(fm), y = resid(fm), xlab = "Fitted values", ylab = "Residuals", main="Residuals vs Fitted") A standard regression diagnostic plot to check for heteroscedasticity. Can you see it? > rm(fm, x, y, dummy) > q() , 3/10 Getting help with functions and features R has an inbuilt help facility similar to the man facility of UNIX. To get more information on any specific named function, for example solve, the command is > help(solve) An alternative is > ?solve For a feature specified by special characters, the argument must be enclosed in double or single quotes, making it a haracter string This is also necessary for a few words with syntactic meaning including if, for and function. > help("[[") Either form of quote mark may be used to escape the other, as in the string ”It’s important”. Our convention is to use double quote marks for preference. , 4/10 Additional Help Features The help.search command allows searching for help in various ways: try ?help.search for details and examples. The examples on a help topic can normally be run by > example(topic) Windows versions of R have other optional help systems: use > ?help for further details. , 5/10 Additional Resources The R-help mailing list: subscribe to R-help from the CRAN webpage best way to get help here is to isolate the problem we are having, then create a simple self-contained example containing the problematic code and posting no questions on the class, homework, etc! (I monitor the list.) The R function RSiteSearch lets us search the archives of this mailing list. Online fora: http://cos.name/en/ or our TA’s website: http://yihui.name/en/ Remember to make use of these resources , 6/10 Reading Data from Files For reading data files, we need to know a few things: R’s input facilities are fairly simple. The requirements are fairly strict and rather inflexible. There is a clear presumption by the designers of R that we are able to modify input files to satisfy R’s input requirements. In many cases, this is straightforward using tools such as file editors, or perl or awk, etc. If variables are to be held mainly in data frames, an entire data frame can be read directly with the read.table() function. There is also a more primitive input function, scan(), that can be called directly. , 7/10 An Example: Housing Data – I Price Floor Area Rooms Age Cent.heat 01 52.00 111.0 830 5 6.2 no 02 54.75 128.0 710 5 7.5 no 03 57.50 101.0 1000 5 4.2 no 04 57.50 131.0 690 6 8.8 no 05 59.75 93.0 900 5 1.9 yes By default numeric items (except row labels) are read as numeric variables and non-numeric variables, such as Cent.heat in the example, as factors. This can be changed if necessary. The function read.table() can then be used to read the data frame directly. > HousePrice <- read.table(file = "http://maitra.public.iastate.edu/stat579/houses.dat") , 8/10 An Example: Housing Data – II Often we may want to omit including the row labels directly and use the default labels. In this case the file may omit the row label column. The data frame may then be read as > HousePrice <- read.table(file = "http://maitra.public.iastate.edu/stat579/houses.dat", header = T) where the header=TRUE option specifies that the first line is a line of headings, and hence, by implication from the form of the file, that no explicit row labels are given. Reading from a local file? > HousePrice <- read.table(file = "houses.dat", header = T) In Windows, this is quite different (see next page). , 9/10 Reading Local Files on Windows Get the path name of the local file Let us say it is: C:\Documents and Settings\stat579\houses.dat Then we use: Houses <- read.table(file = ‘‘C:\\Documents and Settings\\stat579\\houses.dat’’, header = T) Note the extra backslash before each backslash which tells R to read it in as a special character. More ways of reading in datafiles will be addressed later. , 10/10