Uploaded by Raji M

ch-1

advertisement
Introduction to R Programming
Introduction
• R is a programming language and a software system for computations
and graphics.
• R was originally developed in 1992 by Ross Ihaka and Robert
Gentleman at the University of Auckland in New Zealand.
• The R language is a “dialect” of the S language, which was developed
(mainly) by John Chambers at Bell Laboratories.
• R is open source; the source code for R is available under the GNU
General Public License, meaning that users can modify, copy, and
redistribute the software or derivatives, as long as the modified source
code is made available.
• The software is regularly updated, but changes are usually not major.
Installation of R
• The R Core Team maintains a network of servers that contains installation
files and documentation on R, called the Comprehensive R Archive
Network, or CRAN.
• You can access it through http: //cran.r-project.org/, or a Google search for
CRAN R.
• R is available for Windows, Mac, and Unix–like operating systems.
• Installation files and instructions can be downloaded from the CRAN site by
selecting one of the download links at the top.
R and RStudio
• There are two basic ways to use R on your machine:
• interactively through a graphical user interface (GUI) or
• shell, where R evaluates your code and returns results as you work, or
by writing, saving, and then running R script files.
• New users should work with the integrated development environment
(IDE) called RStudio.
• The RStudio IDE is available for Windows, Mac OS X, and Linux operating
systems.
• It generally makes learning R easier and using R more efficient.
• It is now much more than a script editor, and includes tools for building
packages and writing dynamic reports, among others.
Applications of R Programming in Real World
• Data Science: Programming languages like R give a data scientist superpower that allow
them to collect data in realtime, perform statistical and predictive analysis, create
visualizations and communicate actionable results to stakeholders.
• Statistical computing: R is the most popular programming language among statisticians.
In fact, it was initially built by statisticians for statisticians. It has a rich package repository
with more than 9100 packages with every statistical function you can imagine. R’s
expressive syntax allows researchers - even those from non computer science
backgrounds to quickly import, clean and analyze data from various data sources. R also
has charting capabilities, which means you can plot your data and create interesting
visualizations from any dataset.
• Machine Learning: R has found a lot of use in predictive analytics and machine learning.
It has various package for common ML tasks like linear and non-linear regression,
decision trees, linear and non-linear classification and many more. Everyone from
machine learning enthusiasts to researchers use R to implement machine learning
algorithms in fields like finance, genetics research, retail, marketing and health care.
Working with R session
• We can either type the command lines on the screen inside an "Rsession", or we can save the commands as a "script" file and execute
the whole file inside R.
• To start an R session, type 'R' from the command line in windows or linux OS. For
example, from shell prompt '$' in linux, type
• $R
• Once we are inside the R session, we can directly execute R language commands
by typing them line by line. Pressing the enter key terminates typing of command
and brings the > prompt again.
Working with R session
• In the example session below, we declare 2 variables 'a' and 'b' to have
values 5 and 6 respectively, and assign their sum to another variable
called 'c':
>a=5
>b=6
>c=a+b
>c
• The value of the variable 'c' is printed as,
• [1] 11
Working with R session
• To get help on any function of R, type help(function-name) in
R prompt. For example, if we need help on "if" logic, type,
> help("if")
• then, help lines for the "if" statement is printed.
Comments
• Single comment is written using # in the beginning of the statement as follows:
• # My first program in R Programming
• R does not support multi-line comments but you can perform a trick which is something as
follows:
if(FALSE)
{
"This is a demo for multi-line comments and
it should be put inside either a single of double quote"
}
myString <- "Hello, World!"
print ( myString)
Though above comments will be executed by R interpreter, they will not interfere with your actual
program. You should put such comments inside, either single or double quote.
R Reserved Words
• Reserved words in R programming are a set of words that have special
meaning and cannot be used as an identifier (variable name, function
name etc.).
• Here is a list of reserved words in the R's parser.
R Reserved Words
• Among these words, if, else, repeat, while, function, for, in, next
and break are used for conditions, loops and user defined
functions.
• They form the basic building blocks of programming in R.
• TRUE and FALSE are the logical constants in R.
• NULL represents the absence of a value or an undefined value.
• Inf is for "Infinity", for example when 1 is divided by 0,
• whereas NaN is for "Not a Number", for example when 0 is divided
by 0.
• NA stands for "Not Available" and is used to represent missing
values.
• R is a case sensitive language, which means that TRUE and True are
not the same.
R Variables and Constants
• Rules for writing Identifiers in R
1. Identifiers can be a combination of letters, digits, period (.) and
underscore (_).
2. It must start with a letter or a period. If it starts with a period, it cannot
be followed by a digit.
3. Reserved words in R cannot be used as identifiers.
• Example:
• Valid identifiers in R total, Sum, .fine.with.dot, this_is_acceptable,
Number5
• Invalid identifiers in R tot@l, 5um, _fine, TRUE, .0ne
R Variables and Constants
• Constants, as the name suggests, are entities whose value cannot be
altered. Basic types of constants are numeric constants and character
constants.
• Numeric Constants
• All numbers fall under this category.
• They can be of type integer, double or complex.
• It can be checked with the typeof() function.
• Numeric constants followed by L are regarded as integer and those
followed by i are regarded as complex.
R Variables and Constants
>
typeof(5)
> typeof(5L)
Numeric constants preceded
by 0x or 0X are interpreted as
hexadecimal numbers.
[1] "integer"
> 0xff
> typeof(5i)
[1] 255
[1] "complex"
> 0XF + 1
[1] "double"
[1] 16
R Variables and Constants
• Character Constants
• Character constants can be represented using either single quotes (') or
double quotes (") as delimiters.
> 'example'
[1] "example"
> typeof("5")
[1] "character"
• Built-in Constants
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P"
"Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
"r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
> pi
[1] 3.141593
> month.name
[1] "January" "February" "March" "April" "May" "June"
[7] "July" "August" "September" "October" "November" "December"
> month.abb
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
"Dec"
Example: Hello World Program
# We can use the print()
function
>
> print("Hello World!")
> # If there are more than 1 item,
we can concatenate using paste()
[1] "Hello World!"
> print(paste("How","are","you?"))
[1] "How are you?"
> # Quotes can be suppressed in
the output
> print("Hello World!", quote =
FALSE)
[1] Hello World!
R - Data Types
• In any programming language, you need to use various variables
to store various information.
• Variables are nothing but reserved memory locations to store
values.
• This means that, when you create a variable you reserve some
space in memory.
• You may like to store information of various data types like
character, wide character, integer, floating point, double floating
point, Boolean etc.
R - Data Types
• In contrast to other programming languages like C and java in R,
the variables are not declared as some data type.
• The variables are assigned with R-Objects and the data type of the
R-object becomes the data type of the variable.
• There are many types of R-objects.
• R is an object-oriented language. Everything in R is an object.
• When R does anything, it creates and manipulates objects.
R - Data Types
• R’s objects come in different types and flavors.
• Vectors: These are one-dimensional sequences of elements of the same
mode. For example, this could be vector of length 26 (i.e. one containing
26 elements) where each element is a letter in the alphabet.
• Matrices & Arrays: These are two dimensional rectangular objects
(matrices) and higher dimensional rectangular objects (arrays). All
elements of matrices or arrays have to be of the same mode.
• Lists: Lists are like vectors but they do not have to contain elements of the
same mode. The first element of a list could be a vector of the 26 letters of
the alphabet. The second element could contain a vector of all the prime
numbers below 1000. A third could be a 2 by 7 matrix.
R - Data Types
• Data Frames: Data frames are best understood as special matrices
(technically they are a type of list). For most applications involving
datasets you will use data frames.
• Factors: Factors are vectors to classify categorical data. They
behave differently than vectors containing numerical, integer, or
character elements.
• Functions: Functions are objects that take other objects as inputs
and return some new object.
R - Data Types
• All objects have a certain mode.
• Some objects can only deal with one mode at a time, others can store elements
of multiple modes.
• R distinguishes the following modes:
1. integer: integers (e.g. 1, 2 or -69)
2. numeric: real numbers (e.g 2.336, -0.35)
3. complex: complex or imaginary numbers
4. character: elements made up of text-strings (e.g. "text", "Hello World!", or
"123")
5. logical: data containing logical constants (i.e. TRUE and FALSE)
Vectors
• A vector is simply a list of items that are of the same type.
• To combine the list of items to a vector, use the c() function and
separate the items by a comma.
• Vectors are the most basic R data objects and there are six types of
atomic vectors.
• They are logical, integer, double, complex, character and raw.
• Single Element Vector :Even when you write just one value in R, it
becomes a vector of length 1 and belongs to one of the above
vector types.
Vectors
# Atomic vector of type character.
print("abc");
# Atomic vector of type double.
print(12.5)
# Atomic vector of type integer.
print(63L)
# Atomic vector of type logical.
print(TRUE)
# Atomic vector of type complex.
print(2+3i)
# Atomic vector of type raw.
print(charToRaw('hello'))
[1] "abc“
[1] 12.5
[1] 63
[1] TRUE
[1] 2+3i
[1] 68 65 6c 6c 6f
Vectors
Multiple Elements Vector
• To create a vector with numerical
values in a sequence, use the : operator
• # Creating a sequence from 5 to 13.
v <- 5:13
print(v)
• # Creating a sequence from 6.6 to 12.6.
v <- 6.6:12.6
print(v)
• # If the final element specified does
not belong to the sequence then it is
discarded.
v <- 3.8:11.4
print(v)
Output:
[1] 5 6 7 8 9 10 11 12 13
[1] 6.6 7.6 8.6 9.6 10.6 11.6 12.6
[1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8
Vectors
Using sequence (Seq) operator
• # Create vector with elements from
5 to 9 incrementing by 0.4
Output:
[1] 5.0 5.4 5.8 6.2 6.6 7.0 7.4 7.8 8.2 8.6 9.0
print(seq(5, 9, by = 0.4))
# Create vector with elements from 5
to 9 incrementing by 0.6
[1]
5.0
5.6
6.2
6.8
7.4
8.0
8.6
print(seq(5, 9, by = 0.6))
Vectors
• Using the c() function
• The non-character values are coerced to character type if one of the
elements is a character.
• # The logical and numeric values are converted to characters.
s <- c('apple','red',5,TRUE)
print(s)
Output:
[1] "apple" "red" "5"
"TRUE"
Vectors
• Accessing Vector Elements
• Elements of a Vector are accessed using indexing.
• The [ ] brackets are used for indexing. Indexing starts with position 1.
• Giving a negative value in the index drops that element from result.
• TRUE, FALSE or 0 and 1 can also be used for indexing.
Vectors
• # Accessing vector elements using position.
t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
u <- t[c(2,3,6)]
print(u)
• # Accessing vector elements using logical indexing.
v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)
• # Accessing vector elements using negative indexing.
x <- t[c(-2,-5)]
print(x)
• # Accessing vector elements using 0/1 indexing.
y <- t[c(0,0,0,0,0,0,1)]
print(y)
Output:
[1] "Mon" "Tue" "Fri“
[1] "Sun" "Fri“
[1] "Sun" "Tue" "Wed" "Fri" "Sat“
[1] "Sun"
Vectors
• # Accessing vector elements using position.
t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
u <- t[c(2,3,6)]
print(u)
• # Accessing vector elements using logical indexing.
v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)
• # Accessing vector elements using negative indexing.
x <- t[c(-2,-5)]
print(x)
• # Accessing vector elements using 0/1 indexing.
y <- t[c(0,0,0,0,0,0,1)]
print(y)
Output:
[1] "Mon" "Tue" "Fri“
[1] "Sun" "Fri“
[1] "Sun" "Tue" "Wed" "Fri" "Sat“
[1] "Sun"
Vector Manipulation
• To find out how many items a vector has, use the length() function:
>fruits <- c("banana", "apple", "orange")
>length(fruits)
• When we execute the above code, it produces the following result –
[1] 3
Vector Manipulation
• Two vectors of same length can be added, subtracted,
multiplied or divided giving the result as a vector output.
# Create two vectors.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11,0,8,1,2)
# Vector addition.
result <- v1+v2
Output:
print(result)
[1] 7 19 4 13 1 13
Vector Manipulation
• # Vector subtraction.
sub.result <- v1-v2
print(sub.result)
• # Vector multiplication.
multi.result <- v1*v2
print(multi.result)
• # Vector division.
divi.result <- v1/v2
print(divi.result)
Output:
[1] -1 -3 4 -3 -1 9
[1] 12 88 0 40 0 22
[1] 0.7500000 0.7272727
0.0000000 5.5000000
Inf 0.6250000
Vector Element Recycling
• If we apply arithmetic operations to two vectors of unequal length, then the
elements of the shorter vector are recycled to complete the operations.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomes c(4,11,4,11,4,11)
add.result <- v1+v2
print(add.result)
[1] 7 19 8 16 4 22
sub.result <- v1-v2
print(sub.result)
[1] -1 -3 0 -6 -4 0
• Elements in a vector can be sorted using
the sort() function.
v <- c(3,8,4,5,0,11, -9, 304)
# Sort the elements of the vector.
sort.result <- sort(v)
print(sort.result)
# Sorting character vectors.
v <- c("Red","Blue","yellow","violet")
sort.result <- sort(v)
print(sort.result)
# Sort the elements in the reverse order.
revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)
• Change an Item
To change the value of a specific item, refer to the index number:
Example:
fruits <- c("banana", "apple", "orange", "mango", "lemon")
# Change "banana" to "pear"
fruits[1] <- "pear"
# Print fruits
fruits
Output:
[1] "pear" "apple" "orange" "mango" "lemon"
Lists
• Lists are the R objects which contain elements of different types like −
numbers, strings, vectors and another list inside it.
• A list can also contain a matrix or a function as its elements.
• List is created using list() function.
• Following is an example to create a list containing strings, numbers, vectors
and a logical values.
• # Create a list containing strings, numbers, vectors and a logical values.
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1)
print(list_data)
Lists
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1)
print(list_data)
[[1]]
[1] "Red"
[[4]]
[1] TRUE
[[2]]
[1] "Green"
[[5]]
[1] 51.23
[[3]]
[1] 21 32 11
[[6]]
[1] 119.1
• The list elements can be given names and they can be accessed using
these names.
• # Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
# Show the list.
print(list_data)
$`1st_Quarter`
[1] "Jan" "Feb" "Mar"
$A_Matrix
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
$A_Inner_list
$A_Inner_list[[1]]
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
Accessing List Elements
• Elements of the list can be accessed by the index of the element in the
list.
• In case of named lists it can also be accessed using the names.
Accessing List Elements
• # Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
• # Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
• # Access the first element of the list.
print(list_data[1])
• $`1st Quarter`
• [1] "Jan" "Feb" "Mar"
Accessing List Elements
• # Create a list containing a vector, a matrix and a list.
• list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2), list("green",12.3))
• # Give names to the elements in the list.
• names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
• # Access the thrid element. As it is also a list, all its elements will be printed.
• print(list_data[3])
• $`A Inner list`
• $`A Inner list`[[1]]
• [1] "green"
• $`A Inner list`[[2]]
• [1] 12.3
Accessing List Elements
• # Create a list containing a vector, a matrix and a list.
• list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
• # Give names to the elements in the list.
• names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
• # Access the list element using the name of the element.
• print(list_data$A_Matrix)
[,1] [,2] [,3]
• [1,] 3 5 -2
• [2,] 9 1 8
Manipulating List Elements
• We can add, delete and update list elements.
• We can add and delete elements only at the end of a
list.
• But we can update any element.
Manipulating List Elements
• # Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
# Add element at the end of the list.
list_data[4] <- "New element"
print(list_data[4])
[[1]]
[1] "New element"
Manipulating List Elements
• # Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
# Remove the last element.
list_data[3] <- NULL
print(list_data)
$`1st Quarter`
[1] "Jan" "Feb" "Mar"
$A_Matrix
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
Manipulating List Elements
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
# Remove the last element.
list_data[3] <- NULL
• # Print the 4th Element.
• print(list_data[3])
$<NA>
NULL
Manipulating List Elements
# Create a list containing a vector, a matrix and a list.
list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2),
list("green",12.3))
# Give names to the elements in the list.
names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list")
# Update the 3rd Element.
list_data[3] <- "updated element"
print(list_data[3])
$`A Inner list`
[1] "updated element"
Merging Lists
• You can merge many lists into one list
by placing all the lists inside one list()
function.
# Create two lists.
list1 <- list(1,2,3)
list2 <- list("Sun","Mon","Tue")
# Merge the two lists.
Merged.list<- c(list1,list2)
# Print the merged list.
print(merged.list)
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] "Sun"
[[5]]
[1] "Mon"
[[6]]
[1] "Tue"
Converting List to Vector
• A list can be converted to a vector so that the elements
of the vector can be used for further manipulation.
• All the arithmetic operations on vectors can be applied
after the list is converted into vectors.
• To do this conversion, we use the unlist() function.
• It takes the list as input and produces a vector.
Converting List to Vector
# Create lists.
list1 <- list(1:5)
print(list1)
[[1]]
[1] 1 2 3 4 5
list2 <-list(10:14)
print(list2)
[[1]]
[1] 10 11 12 13 14
# Convert the lists
to vectors.
v1 <- unlist(list1)
v2 <- unlist(list2)
print(v1)
[1] 1 2 3 4 5
print(v2)
[1] 10 11 12 13 14
# Now add the vectors
result <- v1+v2
print(result)
[1] 11 13 15 17 19
Matrices
• Matrices are the R objects in which the elements are
arranged in a two-dimensional rectangular layout.
• They contain elements of the same atomic types.
• We use matrices containing numeric elements to be used in
mathematical calculations.
• A matrix can be created with the matrix() function. Specify
the nrow and ncol parameters to get the amount of rows
and columns.
Matrices
# Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
# Print the matrix
thismatrix
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
Matrices
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol=3)
thismatrix
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Matrices
• You can also create a matrix with strings:
mat <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
mat
[,1] [,2]
[1,] "apple" "cherry"
[2,] "banana" "orange"
Access Matrix Items
• You can access the items by using [ ] brackets.
• The first number "1" in the bracket specifies the row-position, while the
second number "2" specifies the column-position.
thismatrix <- matrix(c("apple", "banana", "cherry",
"orange"), nrow = 2, ncol = 2)
thismatrix[1, 2]
[1] "cherry"
• The whole row can be accessed if you specify a comma after the
number in the bracket:
> thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2,
ncol = 2)
> thismatrix
[,1] [,2]
[1,] "apple" "cherry"
[2,] "banana" "orange“
> thismatrix[2,]
[1] "banana" "orange"
• The whole column can be accessed if you specify a comma before the
number in the bracket:
> thismatrix <- matrix(c("apple", "banana", "cherry", "orange"),
nrow = 2, ncol = 2)
> thismatrix[,2]
[1] "cherry" "orange"
• More than one row can be accessed if you use the c() function:
thismatrix <- matrix(c("apple", "banana", "cherry", "orange","grape",
"pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3)
> thismatrix
[,1] [,2]
[,3]
[1,] "apple" "orange" "pear"
[2,] "banana" "grape" "melon"
[3,] "cherry" "pineapple" "fig"
> thismatrix[c(1,2),]
[,1] [,2] [,3]
[1,] "apple" "orange" "pear"
[2,] "banana" "grape" "melon"
• More than one column can be accessed if you use the c()
function:
> thismatrix <- matrix(c("apple", "banana", "cherry",
"orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3,
ncol = 3)
> thismatrix[, c(1,2)]
[,1] [,2]
[1,] "apple" "orange"
[2,] "banana" "grape"
[3,] "cherry" "pineapple"
• We can use the cbind() function to add additional columns in a Matrix.
• But ensure that the cells in the new column must be of the same length as
the existing matrix.
thismatrix <matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "
melon", "fig"), nrow = 3, ncol = 3)
[,1] [,2]
[,3] [,4]
[1,] "apple" "orange" "pear"
newmatrix <- cbind(thismatrix,
"strawberry"
c("strawberry", "blueberry", "raspberry"))
[2,] "banana" "grape" "melon"
"blueberry"
# Print the new matrix
[3,] "cherry" "pineapple" "fig"
newmatrix
"raspberry"
• We can use the rbind() function to add additional rows in a Matrix.
• But ensure that the cells in the new row must be of the same length as the
existing matrix.
thismatrix <- matrix(c("apple", "banana", "cherry",
"orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3,
ncol = 3)
newmatrix <- rbind(thismatrix, c("strawberry", "blueberry",
"raspberry"))
# Print the new matrix
newmatrix
• Again, you can use the rbind() or cbind() function to combine two or more
matrices together:
# Combine matrices
Matrix1 <- matrix(c("apple", "banana", "cherry", "grape"), nrow = 2, ncol = 2)
Matrix2 <- matrix(c("orange", "mango", "pineapple", "watermelon"), nrow = 2,
ncol = 2)
# Adding it as a rows
Matrix_Combined <- rbind(Matrix1, Matrix2)
Matrix_Combined
• Again, you can use the rbind() or cbind() function to combine two or more
matrices together:
# Combine matrices
Matrix1 <- matrix(c("apple", "banana", "cherry", "grape"), nrow = 2, ncol = 2)
Matrix2 <- matrix(c("orange", "mango", "pineapple", "watermelon"), nrow = 2,
ncol = 2)
# Adding it as a columns
Matrix_Combined <- cbind(Matrix1, Matrix2)
Matrix_Combined
• We can use the c() function to remove rows and columns in a Matrix.
thismatrix <- matrix(c("apple", "banana", "cherry", "orange", "mango",
"pineapple"), nrow = 3, ncol =2)
#Remove the first row and the first column
thismatrix <- thismatrix[-c(1), -c(1)]
thismatrix
[,1] [,2]
[1,] "apple" "orange"
[2,] "banana" "mango"
[3,] "cherry" "pineapple
[1] "mango"
"pineapple"
Check if an Item Exists
• To find out if a specified item is present in a matrix, use the %in% operator:
Example: Check if "apple" is present in the matrix:
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
"apple" %in% thismatrix
• Output for above code will be:
[1] TRUE
• Use the dim() function to find the number of rows and columns in a Matrix:
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"),
nrow = 2, ncol = 2)
dim(thismatrix)
• Output for above code will be:
[1] 2 2
• We can use the length() function to find the dimension of a Matrix.
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
length(thismatrix)
Output will be:
[1] 4
Arrays
• Arrays are the R data objects which can store data in more than
two dimensions.
• For example − If we create an array of dimension (2, 3, 4) then it
creates 4 rectangular matrices each with 2 rows and 3 columns.
• Arrays can store only one data type elements.
• An array is created using the array() function.
• It takes vectors as input and uses the values in the dim parameter
to create an array.
Arrays
Example :Create an array of two 3x3 matrices each with 3
rows and 3 columns.
# Create two vectors of different lengths.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
# Take these vectors as input to the array.
result <- array(c(vector1,vector2),dim = c(3,3,2))
print(result)
• Here the first and second number in the bracket specifies the
number of rows and columns.
• The last number in the bracket specifies how many dimensions
we want.
Access Array Items
• You can access the array elements by referring to the index position.
• You can use the [] brackets to access the desired elements from an array.
• The syntax is as follows: array(row position, column position, matrix level)
thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))
print(multiarray)
multiarray[2, 3, 2]
[1] 22
• You can also access the whole row or column from a matrix in an array, by
using the c() function.
thisarray <- c(1:24)
# Access all the items from the first row from matrix one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray()
multiarray[c(1),,1]
• You can also access the whole row or column from a
matrix in an array, by using the c() function.
thisarray <- c(1:24)
# Access all the items from the first row from matrix one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray
# Access all the items from the first column from matrix
one
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[,c(1),1]
[1] 1 2 3 4
Data Frames
• A data frame is a table or a two-dimensional array-like structure in
which each column contains values of one variable and each row
contains one set of values from each column.
• These are data displayed in a format as a table.
• Data Frames can have different types of data inside it.
• While the first column can be character, the second and third can
be numeric or logical.
• However, each column should have the same type of data.
Data Frames
Following are the characteristics of a data frame.
•
The column names should be non-empty.
•
The row names should be unique.
•
The data stored in a data frame can be of numeric, factor or
character type.
•
Each column should contain same number of data items.
Data Frames
• We can use the data.frame() function to create a data frame.
# Create a data frame
Data_Frame <- data.frame (Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
# Print the data frame
Data_Frame
• The structure of the data frame can be seen by using str() function.
> str(Data_Frame)
'data.frame': 3 obs. of 3 variables:
$ Training: chr "Strength" "Stamina" "Other"
$ Pulse : num 100 150 120
$ Duration: num 60 30 45
• We can use the dim() function to find the amount of rows and columns in a Data
Frame.
• Also we can also use the ncol() function to find the number of columns and
nrow() to find the number of rows.
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
> dim(Data_Frame)
[1] 3 3
> ncol(Data_Frame)
[1] 3
> nrow(Data_Frame)
[1] 3
Access Items
• We can use single brackets [ ], double brackets [[ ]] or $ to access
columns from a data frame.
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame[1]
Access Items
• We can use single brackets [ ], double brackets [[ ]] or $ to access
columns from a data frame.
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame[["Training"]]
[1] "Strength" "Stamina" "Other"
Access Items
• We can use single brackets [ ], double brackets [[ ]] or $ to access
columns from a data frame.
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
print(Data_Frame$Pulse)
[1] 100 150 120
Access Items
• Extract 2nd and 3rd row with 1st and 2nd column
result <-Data_Frame[c(2,3),c(1,2)]
print(result)
Training Pulse
2 Stamina 150
3 Other 120
Expand Data Frame
• A data frame can be expanded by adding columns and rows.
• We can use the rbind() function to add new rows in a Data Frame.
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
# Add a new row
New_row_DF <- rbind(Data_Frame, c("Strength", 110, 110))
# Print the new row
New_row_DF
• We can use the cbind() function to add new columns in a Data Frame.
# Add a new column
New_col_DF <- cbind(Data_Frame, Steps = c(1000, 6000, 2000))
# Print the new column
New_col_DF
Training Pulse Duration
1 Strength 100 60
2 Stamina 150 30
3 Other
120 45
Steps
1000
6000
2000
• We can use the c() function to remove rows and columns in a Data Frame.
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
# Remove the first row and column
Data_Frame_New <- Data_Frame[-c(1), -c(1)]
# Print the new data frame
Data_Frame_New
Factors
• Factors are the data objects which are used to categorize the
data and store it as levels.
• They can store both strings and integers.
• They are useful in the columns which have a limited number of
unique values.
• They are useful in data analysis for statistical modeling.
Examples of factors are:
• Demography: Male/Female
• Music: Rock, Pop, Classic, Jazz
• Training: Strength, Stamina
Factors
• Factors are created using the factor () function by taking a vector as
input.
# Create a factor
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz",
"Rock", "Jazz"))
# Print the factor
music_genre
[1] Jazz Rock Classic Classic Pop Jazz Rock Jazz
Levels: Classic Jazz Pop Rock
We can see from the example above that the factor has four levels
(categories): Classic, Jazz, Pop and Rock.
Factors
• To only print the levels, use the levels() function:
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic",
"Pop", "Jazz", "Rock", "Jazz"))
levels(music_genre)
[1] "Classic" "Jazz" "Pop"
"Rock"
Factors
To access the items in a factor, refer to the index number, using []
brackets.
For example to access the third item:
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz",
"Rock", "Jazz"))
music_genre[3]
Output will be:
[1] Classic
Levels: Classic Jazz Pop Rock
Factors
• To change the value of a specific item, refer to the index number.
• For example to change the value of the third item:
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz",
"Rock", "Jazz"))
music_genre[3] <- "Pop"
music_genre[3]
output will be:
[1] Pop
Levels: Classic Jazz Pop Rock
Missing values NA
• Missing data or values occurs when the data record is absent in the variable.
• This will cause serious issues in the data modeling process if not treated
properly.
• Above all, most of the algorithms are not comfortable with missing data.
• There are many ways to handle missing data in R. You can drop those
records.
• But, keep in mind that you are dropping information when you do so and may
lose a potential edge in modelling.
• Some functions do not work with their default settings when there are
missing values in the data, and mean is a classic example of this:
x<-c(1:8,NA)
mean(x)
[1] NA
Missing values NA
• In order to calculate the mean of the non-missing values, you need
to specify that the NA are to be removed, using the na.rm=TRUE
argument:
mean(x,na.rm=T)
[1] 4.5
• Here is an example where we want to find the locations (7 and 8)
of missing values within a vector called vmv:
vmv<-c(1:6,NA,NA,9:12)
vmv
[1] 1 2 3 4 5 6 NA NA 9 10 11 12
Data Manipulation Techniques
• In a data analysis process, the data has to be altered, sampled,
reduced or elaborated.
• Such actions are called data manipulation.
• The sort() and the order() functions are included in the base
package of R and are used to sort or order the data in the
desired order.
Data Manipulation Techniques
• The sort() function sorts the elements of a vector or a factor in
increasing or decreasing order.
• The syntax of the sort function is:
sort(x, decreasing = FALSE, na.last = NA, . . .)
• x is the input vector or factor that has to be sorted.
• decreasing determinines decreasing order (TRUE) or in increasing
order (FALSE).
• na.last controls the treatment of the NA values present inside the
input vector/factor.
• If na.last =TRUE, then the NA values are put at the last.
Else na.last= FALSE, then the NA values are put first.
Finally, if it is set as NA, then the NA values are removed.
Data Manipulation Techniques
sort(c(3,16,34,77,29,95,24,47,92,64,43), decreasing = FALSE)
[1] 3 16 24 29 34 43 47 64 77 92 95
sort(c(3,16,34,77,29,95,24,47,92,64,43), decreasing = TRUE)
[1] 95 92 77 64 47 43 34 29 24 16 3
sort(c(3,16,34,77,29,95,24,47,92,64,43))
[1] 3 16 24 29 34 43 47 64 77 92 95
sort(c(3,16,34,77,29,95,24,47,92,64,43), na.last=TRUE)
[1] 3 16 24 29 34 43 47 64 77 92 95
sort(c(3,16,34,77,29,95,24,47,92,64,43), na.last=NA)
[1] 3 16 24 29 34 43 47 64 77 92 95
Data Manipulation Techniques
The order() function returns the indices of the elements of the input
objects in ascending or descending order.
order(. . . , na.last = TRUE, decreasing = FALSE, method = c("auto",
"shell", "radix"))
. . . is a sequence of numeric, character, logical or complex vectors or is
a classed R object.
na.last is the argument that controls the treatment of NA values.
decreasing controls whether the order of the object will be decreasing
or increasing.
method is a character string that specifies the algorithm to be used.
method can take the value of “auto”, “radix”, or “shell”.
Data Manipulation Techniques
Ex1: a <- c(20,40,70,10,50,30,90,60)
a[order(a)]
[1] 10 20 30 40 50 60 70 90
Ex2: #creates a vector
x<-c(3.5,7.8,5.6,1.1,2.9,4.4)
#orders the data in the decreasing fashion
x[order(x,decreasing = T)]
[1] 7.8 5.6 4.4 3.5 2.9 1.1
Data Manipulation Techniques
• Sample() function in R, generates a sample of the specified size from
the data set or elements, either with or without replacement.
• Sample() function is used to get the sample of a numeric and
character vector and also a dataframe.
sample(x, size, replace = FALSE, prob = NULL)
x
size
replace
prob
Data Set or a vector of one or more elements
from which sample is to be chosen
size of a sample
Should sampling be with replacement?
probability weights for obtaining the elements
of the vector being sampled
Data Manipulation Techniques
Example: That generates 10 random sample from vector of 1
to 20. With replacement =TRUE.
• which means, value in the sample can occur more than once.
sample(1:20, 10, replace=TRUE)
[1] 10 16 6 13 6 4 6 1 12 6
sample(1:20, 10, replace=TRUE)
[1] 5 15 9 4 20 17 6 11 16 3
Data Manipulation Techniques
sample(1:5,10,replace=FALSE)
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when
'replace = FALSE'
sample(1:5,10,replace=TRUE)
[1] 1 2 1 3 3 1 4 5 1 4
Merging/combining datasets in R
• The cbind() function combines
two dataset (or data frames) along
their columns.
m1<-c(1:4)
m2<-c(5:8)
cbind(m1,m2)
m1<-matrix(c(1:4),nrow=2,ncol=2)
m1 m2
m2<-matrix(c(5:8),nrow=2,ncol=2)
[1,] 1 5
cbind(m1,m2)
[2,] 2 6
[,1] [,2] [,3] [,4]
[3,] 3 7
[1,] 1 3 5 7
[4,] 4 8
[2,] 2 4 6 8
Merging/combining datasets in R
• The rbind() function combines two data frames along their
rows.
m1<-matrix(c(1:4),nrow=2,ncol=2)
m2<-matrix(c(5:8),nrow=2,ncol=2)
rbind(m1,m2)
[,1] [,2]
[1,] 1 3
[2,] 2 4
[3,] 5 7
[4,] 6 8
Merging/combining datasets in R
• The merge() function performs what is called a join operation in databases.
• This function combines two data frames based on common columns.
m1<-matrix(1:6,nrow=2,ncol=3)
m2<-matrix(11:16, nrow=2,ncol=3)
names <- c('v1','v2','v3')
colnames(m1) <- names
colnames(m2) <- names
merge(m1,m2, by = names, all = TRUE)
Usage of various apply functions
• The apply in R function can be feed with many functions to
perform redundant application on a collection of object
(data frame, list, vector, etc.).
• The purpose of apply() is primarily to avoid explicit uses of
loop constructs.
• apply() takes Data frame or matrix as an input and gives
output in vector, list or array.
apply function
• This function takes 3 arguments:
apply(X, MARGIN, FUN)
• Here:
-x: an array or matrix
-MARGIN: takes a value or range between 1 and 2 to
define where to apply the function
-MARGIN=1`: the manipulation is performed on rows
-MARGIN=2`: the manipulation is performed on columns
-MARGIN=c(1,2)` the manipulation is performed on rows and columns
-FUN: tells which function to apply. Built functions like
mean, median, sum, min, max and even user-defined
functions can be applied.
apply function
m1 <- matrix(C<-(1:10),nrow=5, ncol=6)
m1
a_m1 <- apply(m1, 2, sum)
a_m1
• The code apply(m1, 2, sum) will apply the sum function to the
matrix 5×6 and return the sum of each column accessible in the
dataset.
lapply function
• lapply() function is useful for performing operations on list objects
and returns a list object of same length of original set.
• lappy() returns a list of the similar length as input list object, each
element of which is the result of applying FUN to the corresponding
element of list.
• lapply in R takes list, vector or data frame as input and gives
output in list.
• lapply(X, FUN)
• Arguments:
-X: A vector or an object
-FUN: Function applied to each element of x
lapply function
movies <- c("SPYDERMAN","BATMAN","VERTIGO","CHINATOWN")
movies_lower <-lapply(movies, tolower)
str(movies_lower)
sapply function
• sapply() function takes list, vector or data frame as input and gives
output in vector or matrix.
• It is useful for operations on list objects and returns a list object of
same length of original set.
• Sapply function in R does the same job as lapply() function but
returns a vector.
sapply(X, FUN)
• Arguments:
-X: A vector or an object
-FUN: Function applied to each element of x
sapply function
• Example: We can measure the minimum speed and stopping
distances of cars from the cars dataset (the data were recorded in
the year 1920 and a data frame with 50 observations on 2
variables.).
dt <- cars
lmn_cars <- lapply(dt, min)
smn_cars <- sapply(dt, min)
lmn_cars
smn_cars
tapply function
• tapply() computes a measure (mean, median, min, max, etc..) or a
function for each factor variable in a vector.
• It is a very useful function that lets you create a subset of a vector
and then apply some functions to each of the subset.
tapply(X, INDEX, FUN = NULL)
• Arguments:
-X: An object, usually a vector
-INDEX: A list containing factor
-FUN: Function applied to each element of x
tapply function
• To understand how it works, let’s use the iris dataset.
• The purpose of this dataset is to predict the class of each of the
three flower species: Sepal, Versicolor, Virginica.
• The dataset collects information for each species about their length
and width.
• We can compute the median of the length for each species using
tapply() very quickly.
data(iris)
tapply(iris$Sepal.Width, iris$Species, median)
setosa versicolor virginica
3.4
2.8
3.0
Download