Stat 579: More Preliminaries, Reading from Files Ranjan Maitra

advertisement
Stat 579: More Preliminaries, Reading from
Files
Ranjan Maitra
2220 Snedecor Hall
Department of Statistics
Iowa State University.
Phone: 515-294-7757
maitra@iastate.edu
September 1, 2011
,
1/10
Some more introductory examples – I
Let us make a vector containing the sequence 1 through
20:
> x <- 1:20
How do we call this object? To do that, we simply type:
> x
Let us try a simple operation on this object:
> w <- 1 + sqrt(x)/2
This operation takes element-wise square root of the
vector x and adds 1 to each coordinate.
Moving on, can we get what this does?
> dummy <- data.frame(x = x, y = x +
rnorm(x)*w)
> dummy
and we make a “data frame” of two columns, x and y and
look at it.
,
1/10
Some more introductory examples – II
Consider the following:
> fm <- lm(formula = y ∼ x, data=dummy)
> summary(fm)
Call: lm(formula = y ∼ x, data = dummy)
Residuals: Min 1Q Median 3Q Max -3.6315
-0.8137 0.2134 0.8470 5.0178
Coefficients: Estimate Std. Error t value
Pr(>|t|) (Intercept) 1.63569 0.97234 1.682
0.11 x 0.84072 0.08117 10.358 5.19e-09 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 .
0.1 1
Residual standard error: 2.093 on 18 degrees
of freedom Multiple R-squared: 0.8563,
Adjusted R-squared: 0.8483 F-statistic:
107.3 on 1 and 18 DF, p-value: 5.187e-09
We fit a simple linear regression of y on x, store as a dataframe
and look at the results.
,
2/10
Some more introductory examples – III
> attach(dummy)
Make the columns in the data frame visible as variables. >
plot(x = x, y = y)
> abline(a = 0, b = 1, lty=3) # The true
regression line: (intercept 0, slope 1).
> abline(coef(fm)) # The simple linear
regression line.
> detach()
Removed data frame from the search path.
> plot(x = fitted(fm), y = resid(fm), xlab =
"Fitted values", ylab = "Residuals",
main="Residuals vs Fitted")
A standard regression diagnostic plot to check for
heteroscedasticity. Can you see it?
> rm(fm, x, y, dummy)
> q()
,
3/10
Getting help with functions and features
R has an inbuilt help facility similar to the man facility of
UNIX. To get more information on any specific named
function, for example solve, the command is
> help(solve)
An alternative is
> ?solve
For a feature specified by special characters, the argument
must be enclosed in double or single quotes, making it a
haracter string This is also necessary for a few words with
syntactic meaning including if, for and function.
> help("[[")
Either form of quote mark may be used to escape the
other, as in the string ”It’s important”. Our convention is to
use double quote marks for preference.
,
4/10
Additional Help Features
The help.search command allows searching for help in
various ways: try ?help.search for details and
examples.
The examples on a help topic can normally be run by
> example(topic)
Windows versions of R have other optional help systems:
use
> ?help
for further details.
,
5/10
Additional Resources
The R-help mailing list: subscribe to R-help from the
CRAN webpage
best way to get help here is to isolate the problem we are
having, then create a simple self-contained example
containing the problematic code and posting
no questions on the class, homework, etc! (I monitor the
list.)
The R function RSiteSearch lets us search the archives
of this mailing list.
Online fora: http://cos.name/en/ or our TA’s website:
http://yihui.name/en/
Remember to make use of these resources
,
6/10
Reading Data from Files
For reading data files, we need to know a few things:
R’s input facilities are fairly simple.
The requirements are fairly strict and rather inflexible.
There is a clear presumption by the designers of R that we
are able to modify input files to satisfy R’s input
requirements. In many cases, this is straightforward using
tools such as file editors, or perl or awk, etc.
If variables are to be held mainly in data frames, an entire
data frame can be read directly with the read.table()
function.
There is also a more primitive input function, scan(), that
can be called directly.
,
7/10
An Example: Housing Data – I
Price Floor Area Rooms Age Cent.heat
01 52.00 111.0 830 5 6.2 no
02 54.75 128.0 710 5 7.5 no
03 57.50 101.0 1000 5 4.2 no
04 57.50 131.0 690 6 8.8 no
05 59.75 93.0 900 5 1.9 yes
By default numeric items (except row labels) are read as
numeric variables and non-numeric variables, such as
Cent.heat in the example, as factors. This can be
changed if necessary.
The function read.table() can then be used to read the
data frame directly.
> HousePrice <- read.table(file =
"http://maitra.public.iastate.edu/stat579/houses.dat")
,
8/10
An Example: Housing Data – II
Often we may want to omit including the row labels directly
and use the default labels. In this case the file may omit
the row label column.
The data frame may then be read as
> HousePrice <- read.table(file =
"http://maitra.public.iastate.edu/stat579/houses.dat",
header = T)
where the header=TRUE option specifies that the first line
is a line of headings, and hence, by implication from the
form of the file, that no explicit row labels are given.
Reading from a local file?
> HousePrice <- read.table(file =
"houses.dat", header = T)
In Windows, this is quite different (see next page).
,
9/10
Reading Local Files on Windows
Get the path name of the local file
Let us say it is:
C:\Documents and Settings\stat579\houses.dat
Then we use:
Houses <- read.table(file = ‘‘C:\\Documents and
Settings\\stat579\\houses.dat’’, header = T)
Note the extra backslash before each backslash which tells R
to read it in as a special character.
More ways of reading in datafiles will be addressed later.
,
10/10
Download