R script file - Boston University

advertisement
Introduction to R
Data Analysis and Calculations
Katia Oleinik
koleinik@bu.edu
Scientific Computing and Visualization
Boston University
Boston University
Scientific Computing and Visualization
Introduction to R
R arithmetic operations
Operation
Description
x + y
addition
x - y
subtraction
x * y
multiplication
x / y
division
x ^ y
exponentiation
x %% y
x mod y
x %/% y
integer division
Variable Name rules




Case sensitive : Party ≠ party
Letters, digits, underscores and dots can be used:
Cannot start with a digit, underscore or a dot followed by a digit:
Should not use reserved words (if, else, repeat, etc.)
DNA.data.2012
2012.DNA
which
R atomic constants types:
1.
2.
3.
4.
5.
6.
Integer:
Numeric:
Complex:
Logical:
Character:
Special:
n <- 1 or n <- as.integer(1) or n <- 1L
a <- 2.5
d <- 3 + 12i
ans <- TRUE
name <- “Katia” or name <- ‘Katia’
NULL, NA, Inf, Nan
~1~
Boston University
Scientific Computing and Visualization
Introduction to R
R operators:
Operations
+
>
>=
!
&
Description
*
<
|
/
<=
%%
==
^
!=
Relational
Logical
Model Formulas
~
->
Arithmetic
Assignment
<-
$
List indexing
:
Sequence
R built-in constants:
Constants
LETTERS
letters
month.abb
month.name
pi
Description
T , F
TRUE, FALSE
26 upper-case letters of the Roman alphabet
26 lower-case letters of the Roman alphabet
3-letter abbreviations of month names
month names
π: ratio of circle circumference to diameter
~2~
Boston University
Scientific Computing and Visualization
Introduction to R
R math functions for scalars and vectors:
Function
Description
sin, cos, tan, asin, acos, atan,
atan2, log, log10, log(x,base),
exp, sinh, cosh, …
min(x), max(x), range(x), abs(x)
Various standard trig, log and exp. functions
sum(x), diff(x), prod(x)
Sum, difference and product of vector elements
mean(x), median(x),sd(x), var(x)
Mean, median, standard deviation, variance
weighted.mean(x,w)
Mean of x with weights w
quantile(x,probs=)
Sample quantiles corresponding to the given probabilities
(defaults to 0,.25,.5,.75,1)
round(x, n)
Rounds the elements of x to n decimals
Re(x), Im(x), Conj(x)
Real, imaginary part of a complex number, Conjugate of a
number
Arg(x)
Angle in radians of the complex number
fft(x)
Fast Fourier Transform of an array
pmin(x,y,…), pmax(x,y,…)
A vector which ith element is min/max of (x[i],y[i],…)
cumsum(x), cumprod(x)
A vector, which ith element is a sum/product from x[1] to x[i]
cummin(x), cummax(x)
A vector, which ith element is a min/max from x[1] to x[i]
var(x,y) or cov(x,y)
Covariance between 2 vectors
cor(x,y)
Linear correlation between x and y
length(x)
Get the length of the vector
factorial(n)
Calculate n!
choose(n,m)
Combination function: n! / ( k! * (n - k)! )
Minimum/maximum, range and absolute value
*Note: Many math functions have a logical parameter na.rm=FALSE to specify missing data
(NA) removal.
~3~
Boston University
Scientific Computing and Visualization
Introduction to R
Directories and Workspace:
Function
Description
getwd()
Get working directory
setwd(“/projects/myR/”)
Set current directory
ls()
List objects in the current workspace
rm(x,…)
Remove objects from the current workspace
list.files()
List files in the current directory
list.dirs()
List directories
file.info(“myfile.xls”)
Get file properties
file.exists(“myfile.xls”)
Check if file exists
file.remove(“myfile.xls”)
Delete file
file.append(file1, file2)
Append file2 to file1
file.copy(from, to, …)
Copy file
system(“ls -la”)
Execute command in the operating system
save.image()
Save contents of the current workspace in the default file .Rdata
save.image(file=”myR.Rdata”) Save contents of the current workspace in the file
save(a,b, file = “ab.Rdata”) Save a and b in the file
load(“myR.Rdata”)
Restore workspace from the file
~4~
Boston University
Scientific Computing and Visualization
Introduction to R
Loading and Saving Data:
Function
Description
read.table(file=”myData.txt”, header=TRUE)
Read text file
read.csv(file=”myData.csv”)
Read csv file (“,” – default separator)
list.files(); dir()
List all files in current directory
file.show(file=”myData.csv”)
Show file content
write.table(file=”myData.txt”,…)
Save data into a file
write.csv(file=”myData.csv”,…)
Save data into csv formatted file
Performance Tip:
-
For large data files, specify optional parameters if known:
read.table(file, nrows=10000, colClasses=c(”integer”,…), comment.char=””)
-
When reading matrices, use scan() function instead of read.table()
Exploring the data:
Function
Description
class(x)
Get class attribute of an object
names(x)
Function to get or set names of an object
head(x), tail(x)
Returns the first/last parts of vector, matrix, dataframe, function
str(x)
Structure of an object
dimnames(x)
Retrieve or set dimnames of an object
length(x)
Get or set the length of a vector or factor
summary(x)
Generic function – produces summary of the data
attributes(x)
List object’s attributes
dim(x)
Retrieve or set the dimension of an object
nrow(x), ncol(x)
Return the number of rows or columns of vector, matrix or dataframe
row.names()
Retrieve or set the names of the rows
~5~
Boston University
Scientific Computing and Visualization
Introduction to R
R script file




R script is usually saved in a file with extension .R (or .r).
# - serves as a comment indicator (every character on the line after #-sign is ignored
source(“myScript.R”) will load the script into R workspace and execute it
source(“myScript.R”, echo=TRUE) will load and execute the script and also show
the content of the file
R script example (weather.R)
# This script loads data from a table and explore the data
# Script is written for Introduction to R tutorial
# Load datafile
weather <- read.csv(“BostonWeather_sept2012.csv”)
# Get header names
names(weather)
# Get class of the loaded object
class (weather)
# Get attributes
attributes(weather)
# Get dimensions of the loaded data
dim(weather)
# Get structure of the loaded object
str(weather)
# Summary of the data
summary(weather)
~6~
Boston University
Scientific Computing and Visualization
Introduction to R
Installing and loading R packages


To install R package from cran website: install.packages(“package”)
library( package )- loads package into workspace. Library has to

be loaded every time you open a workspace.
Another way to load package into workspace is require(package). Usually used inside


functions. It returns FALSE and gives a warning (rather than error) if package does not exist.
installed.packages() – retrieve details about all packages installed
library() lists all available packages


search() lists all loaded packages
library(help = package) provides information about all the functions in a package
Getting help
Function
Description
Example
?topic
Get R documentation on topic
?mean
help(topic)
Get R documentation on topic
help(mean)
help.search(“topic”)
Search the help for topic
help.search(“mean”)
example(topic)
Get example of function usage
example(mean)
apropos(“topic”)
Get the names of all objects in the search
list that match string “topic”
apropos(“mean”)
methods(function)
List all methods of the function
methods(mean)
function_name
Printing a function name without
parenthesis in most cases will show its code
mean
~7~
Boston University
Scientific Computing and Visualization
Introduction to R
R object types:
o Vector – a set of elements of the same type.
o Matrix - a set of elements of the same type organized in rows and columns.
o Data Frame - a set of elements organized in rows and columns, where columns can
be of different types.
o List - a collection of data objects (possibly of different types) – a generalization of a
vector.
Vector creation (examples):
#Create a vector using concatenation of elements: c()
v1 <- c( 5,8,3,9)
v2 <- c( “One”, “Two”, “Three” )
#Generate sequence (from:to)
s1 <- 2:5
#Sequence function: seq(from, to, by, length.out)
seq(0,1,length.out=5)
[1] 0.00 0.25 0.50 0.75 1.00
seq(1, 6, by = 3)
[1] 1 4
seq(4)
[1] 1 2
3
4
#Generate vector using repeat function: rep(x,times)
rep(7, 3)
[1] 7 7 7
~8~
Boston University
Scientific Computing and Visualization
Introduction to R
Accessing vector elements:
Indexing vectors
x[n]
x[-n]
x[1:n]
x[-(1:n)]
x[c(1,3,6)]
x[x>3 & x<7]
x[x<3 | x>7]
Description
nth element
all but nth element
first n elements
elements starting from n+1
specific elements
all element greater than 3 and less than 7
all element less than 3 or greater than 7
Performance Tip:
-
R is designed to work with vectors very efficiently – avoid using loops to perform the same
operation on each element – rather apply function on the whole vector!
-
For large arrays avoid dynamic expansion if possible. Allocate memory to hold the result and
then fill in the values.
Useful vector operations:
Operation
sort(x)
rev(x)
which.max(x)
which.min(x)
which (x == a)
na.omit(x)
x[is.na(x)] <- 0
Description
Returns sorted vector(in increasing order)
Reverses elements of x
Returns index of the largest element
Returns index of the smallest element
Returns vector of indices i, for which x[i]==a
Surpresses the observations with missing data
Replace all missing elements with zeros
~9~
Boston University
Scientific Computing and Visualization
Introduction to R
Matrix creation (examples):
#Create a matrix using function: matrix(data,nrow,ncol,byrow=F)
matrix( seq(1:6), nrow=2)
[,1] [,2] [,3]
[1,] 1
3
5
[2,] 2
4
6
#Create a diagonal matrix: diag( )
diag( 3 )
[,1]
[,2]
diag( 4, 2, 2 )
[,3]
[,1]
[,2]
[1,]
1
0
0
[1,]
4
0
[2,]
0
1
0
[2,]
0
4
[3,]
0
0
1
#Combine arguments by column: cbind()
cbind(c(1,2,3), c(4,5,6))
[,1] [,2]
[1,]
1
4
[2,]
2
5
[3,]
3
6
#Combine arguments by row: rbind()
rbind(c(1,2,3), c(4,5,6))
[,1] [,2] [,3]
[1,]
1
2
3
[2,]
4
5
6
#Create matrix using array(x, dim) function
array(1:6, c(2,3)))
[,1] [,2] [,3]
[1,] 1
3
5
[2,] 2
4
6
~ 10 ~
Boston University
Scientific Computing and Visualization
Introduction to R
Accessing matrix elements:
Indexing matrices
x[i,j]
x[i,]
x[,j]
x[c(1,5),]
x[,c(2,3,6)]
x[“name”,]
x[,“name”]
Description
Element at row i, column j
Row i (output is a vector)
Column j (output is a vector)
Rows 1 and 5 (output is a matrix)
Columns 2 ,3 and 6 (output is a matrix)
Row named “name”
Column named “name”
Performance Tip:
-
When calculating mean or a sum of a row/column elements use rowSums(),
rowMeans(), colSums(), colMean() functions. They perform faster for
matrices than sum() and mean() functions.
-
For large matrices avoid dynamic expansion (using cbind() and rbind() if possible.
Allocate memory to hold the result and then fill in the values.
Useful matrix operations:
Operation
t(x)
x * y
x %*% y
diag(x)
det(x)
solve(x)
solve(a,b)
rowSums(), colSums()
rowMeans(),colMeans()
Description
Transpose
Multiply elements of 2 matrices
Perform “normal” matrix multiplication
Returns a vector of diagonal elements
Returns determinant of matrix
Returns inverse matrix (if exists), error-otherwise
Returns solution vector for system Ax=b
Returns vector with a sum of each row/column
Returns vector with mean values of each row/column
~ 11 ~
Boston University
Scientific Computing and Visualization
Introduction to R
Data frames:
-
-
elements organized in rows and columns, where columns can be of different types
All elements in the same column must have the same data type
Usually obtained by reading a data file.
Can be created using data.frame() function
#Create a data frame using function: data.frame()
name <- c(“Paul”, “Simon”, “Robert”)
age <- c(8, 12, 3)
height <- c(53.5, 64.8, 35.2)
family <- data.frame(Name = name, Age = age, Height = height);
family
Name
Age
Height
1
Paul
8
53.5
2 Simon
12
64.8
3 Robert
3
35.2
#To sort data frame using one column
family[order(family$Age),]
Name
Age
Height
3 Robert
3
35.2
1
Paul
8
53.5
2 Simon
12
64.8
Accessing data frame elements:
Indexing matrices
x[[i]]
x[[“name”]]
x$name
x[,i]
x[j,]
x[i:j,]
x[i,j]
x[i, “name”]
Description
Accessing
Accessing
Accessing
Accessing
Accessing
Accessing
Accessing
Accessing
column i (returns vector)
column named “name” (returns vector)
column named “name” (returns vector)
column i (returns vector)
row j (returns dataframe!)
rows from i to j
element in row i and column j
element in row i and column “name”
~ 12 ~
Boston University
Scientific Computing and Visualization
Introduction to R
Lists:
-
-
Generalization of vector: ordered collection of components
Elements can be of any mode or type
Many R functions return list as their output object
Can be created using list() function
#Create a list using function: list()
lst <- list(name=“Fred”, no.children=3, child.ages=c(12,8,3))
#Create a list using concatenation: c()
list.ABC <- c(list.A, list.B, list.C)
#List can be created from different R objects
list.misc<-list(e1 = c(1,2,3), e2 = list.B, e3 = matrix(1:4,2) )
Accessing list elements:
Indexing matrices
x[[i]]
x[[“name”]]
x$name
x[i:j,]
Description
Accessing
Accessing
Accessing
Accessing
component i
component named “name”
component named “name”
components from i to j
~ 13 ~
Boston University
Scientific Computing and Visualization
Introduction to R
Factors:
-
a numeric vector that stores the number of levels of a vector. It provides an easy way
to store character strings common for categorical variables
Performance Tip:
-
Use factors to store vectors (especially character vectors) that take only few values
(categorical variables).
-
Factors take less memory and are faster to process, than vectors
Factor operations:
Operation
factor(x)
relevel(x, ref=…)
levels(x)
attributes(x)
table()
is.factor(x)
cut(x, breaks)
gl(n,k,length=n*k,labels=1:n)
Description
Convert vector to a factor
Rearrange the order of levels in a factor
List levels in a factor
Inspect attributes of a factor
Get count of elements in each level
Checks if x is a factor. Returns TRUE or FALSE
Divide x into intervals (factors)
Generate factors by specifying pattern
~ 14 ~
Boston University
Scientific Computing and Visualization
Introduction to R
Regression analysis
Function
lm()
glm()
nls()
residuals()
deviance()
gls()
gnls()
x[,“name”]
Description
Linear regression
Generalized linear regression
Non-linear regression
The difference between observed values and fitted values
Returns the deviance
Fit linear model using generalized least squares
Fit nonlinear model using generalized least squares
Column named “name”
Miscellanies functions for data analysis
Function
optim()
nlm()
spline()
kmeans()
ts()
t.test()
binom.test()
merge()
sample()
density()
logLik(fit)
predict(fit,…)
anova()
aov(formula)
Description
General purpose optimization
Minimize function
Spline interpolation
k-means clustering on a data matrix
Create a time series
Students’ t-test
Binomial test
Merge 2 data frames
Sampling
Kernel density estimates of x
Computes the logarithm of the likelihood
Predictions from fit based on input data
Analysis of variance (or deviance)
Analysis of variance model
~ 15 ~
Boston University
Scientific Computing and Visualization
Introduction to R
Distributions
Function
rnorm(n, mean=0, sd = 1)
runif(n, min=0, max = 1)
rexp(n , rate=1)
rgamma(n , shape, scale=1)
rpois(n, lambda)
rcauchy(n, location=0, scale=1)
rbeta(n , shape, scale=1)
rchisq(n, df)
rbinom(n, size, prob)
rgeom(n, prob)
rlogistic(n, location=0, scale=1)
rlnorm(n, meanlog=0, sdlog=1)
rt(n, df)
Description
Gaussian
Uniform
Exponential
Gamma
Poisson
Cauchy
Beta
Pearson
Binomial
Geometric
Logistic
Lognormal
Student
~ 16 ~
Download