Introduction to R Programming Introduction • R is a programming language and a software system for computations and graphics. • R was originally developed in 1992 by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand. • The R language is a “dialect” of the S language, which was developed (mainly) by John Chambers at Bell Laboratories. • R is open source; the source code for R is available under the GNU General Public License, meaning that users can modify, copy, and redistribute the software or derivatives, as long as the modified source code is made available. • The software is regularly updated, but changes are usually not major. Installation of R • The R Core Team maintains a network of servers that contains installation files and documentation on R, called the Comprehensive R Archive Network, or CRAN. • You can access it through http: //cran.r-project.org/, or a Google search for CRAN R. • R is available for Windows, Mac, and Unix–like operating systems. • Installation files and instructions can be downloaded from the CRAN site by selecting one of the download links at the top. R and RStudio • There are two basic ways to use R on your machine: • interactively through a graphical user interface (GUI) or • shell, where R evaluates your code and returns results as you work, or by writing, saving, and then running R script files. • New users should work with the integrated development environment (IDE) called RStudio. • The RStudio IDE is available for Windows, Mac OS X, and Linux operating systems. • It generally makes learning R easier and using R more efficient. • It is now much more than a script editor, and includes tools for building packages and writing dynamic reports, among others. Applications of R Programming in Real World • Data Science: Programming languages like R give a data scientist superpower that allow them to collect data in realtime, perform statistical and predictive analysis, create visualizations and communicate actionable results to stakeholders. • Statistical computing: R is the most popular programming language among statisticians. In fact, it was initially built by statisticians for statisticians. It has a rich package repository with more than 9100 packages with every statistical function you can imagine. R’s expressive syntax allows researchers - even those from non computer science backgrounds to quickly import, clean and analyze data from various data sources. R also has charting capabilities, which means you can plot your data and create interesting visualizations from any dataset. • Machine Learning: R has found a lot of use in predictive analytics and machine learning. It has various package for common ML tasks like linear and non-linear regression, decision trees, linear and non-linear classification and many more. Everyone from machine learning enthusiasts to researchers use R to implement machine learning algorithms in fields like finance, genetics research, retail, marketing and health care. Working with R session • We can either type the command lines on the screen inside an "Rsession", or we can save the commands as a "script" file and execute the whole file inside R. • To start an R session, type 'R' from the command line in windows or linux OS. For example, from shell prompt '$' in linux, type • $R • Once we are inside the R session, we can directly execute R language commands by typing them line by line. Pressing the enter key terminates typing of command and brings the > prompt again. Working with R session • In the example session below, we declare 2 variables 'a' and 'b' to have values 5 and 6 respectively, and assign their sum to another variable called 'c': >a=5 >b=6 >c=a+b >c • The value of the variable 'c' is printed as, • [1] 11 Working with R session • To get help on any function of R, type help(function-name) in R prompt. For example, if we need help on "if" logic, type, > help("if") • then, help lines for the "if" statement is printed. Comments • Single comment is written using # in the beginning of the statement as follows: • # My first program in R Programming • R does not support multi-line comments but you can perform a trick which is something as follows: if(FALSE) { "This is a demo for multi-line comments and it should be put inside either a single of double quote" } myString <- "Hello, World!" print ( myString) Though above comments will be executed by R interpreter, they will not interfere with your actual program. You should put such comments inside, either single or double quote. R Reserved Words • Reserved words in R programming are a set of words that have special meaning and cannot be used as an identifier (variable name, function name etc.). • Here is a list of reserved words in the R's parser. R Reserved Words • Among these words, if, else, repeat, while, function, for, in, next and break are used for conditions, loops and user defined functions. • They form the basic building blocks of programming in R. • TRUE and FALSE are the logical constants in R. • NULL represents the absence of a value or an undefined value. • Inf is for "Infinity", for example when 1 is divided by 0, • whereas NaN is for "Not a Number", for example when 0 is divided by 0. • NA stands for "Not Available" and is used to represent missing values. • R is a case sensitive language, which means that TRUE and True are not the same. R Variables and Constants • Rules for writing Identifiers in R 1. Identifiers can be a combination of letters, digits, period (.) and underscore (_). 2. It must start with a letter or a period. If it starts with a period, it cannot be followed by a digit. 3. Reserved words in R cannot be used as identifiers. • Example: • Valid identifiers in R total, Sum, .fine.with.dot, this_is_acceptable, Number5 • Invalid identifiers in R tot@l, 5um, _fine, TRUE, .0ne R Variables and Constants • Constants, as the name suggests, are entities whose value cannot be altered. Basic types of constants are numeric constants and character constants. • Numeric Constants • All numbers fall under this category. • They can be of type integer, double or complex. • It can be checked with the typeof() function. • Numeric constants followed by L are regarded as integer and those followed by i are regarded as complex. R Variables and Constants > typeof(5) > typeof(5L) Numeric constants preceded by 0x or 0X are interpreted as hexadecimal numbers. [1] "integer" > 0xff > typeof(5i) [1] 255 [1] "complex" > 0XF + 1 [1] "double" [1] 16 R Variables and Constants • Character Constants • Character constants can be represented using either single quotes (') or double quotes (") as delimiters. > 'example' [1] "example" > typeof("5") [1] "character" • Built-in Constants > LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" [20] "T" "U" "V" "W" "X" "Y" "Z" > letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" [20] "t" "u" "v" "w" "x" "y" "z" > pi [1] 3.141593 > month.name [1] "January" "February" "March" "April" "May" "June" [7] "July" "August" "September" "October" "November" "December" > month.abb [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" Example: Hello World Program # We can use the print() function > > print("Hello World!") > # If there are more than 1 item, we can concatenate using paste() [1] "Hello World!" > print(paste("How","are","you?")) [1] "How are you?" > # Quotes can be suppressed in the output > print("Hello World!", quote = FALSE) [1] Hello World! R - Data Types • In any programming language, you need to use various variables to store various information. • Variables are nothing but reserved memory locations to store values. • This means that, when you create a variable you reserve some space in memory. • You may like to store information of various data types like character, wide character, integer, floating point, double floating point, Boolean etc. R - Data Types • In contrast to other programming languages like C and java in R, the variables are not declared as some data type. • The variables are assigned with R-Objects and the data type of the R-object becomes the data type of the variable. • There are many types of R-objects. • R is an object-oriented language. Everything in R is an object. • When R does anything, it creates and manipulates objects. R - Data Types • R’s objects come in different types and flavors. • Vectors: These are one-dimensional sequences of elements of the same mode. For example, this could be vector of length 26 (i.e. one containing 26 elements) where each element is a letter in the alphabet. • Matrices & Arrays: These are two dimensional rectangular objects (matrices) and higher dimensional rectangular objects (arrays). All elements of matrices or arrays have to be of the same mode. • Lists: Lists are like vectors but they do not have to contain elements of the same mode. The first element of a list could be a vector of the 26 letters of the alphabet. The second element could contain a vector of all the prime numbers below 1000. A third could be a 2 by 7 matrix. R - Data Types • Data Frames: Data frames are best understood as special matrices (technically they are a type of list). For most applications involving datasets you will use data frames. • Factors: Factors are vectors to classify categorical data. They behave differently than vectors containing numerical, integer, or character elements. • Functions: Functions are objects that take other objects as inputs and return some new object. R - Data Types • All objects have a certain mode. • Some objects can only deal with one mode at a time, others can store elements of multiple modes. • R distinguishes the following modes: 1. integer: integers (e.g. 1, 2 or -69) 2. numeric: real numbers (e.g 2.336, -0.35) 3. complex: complex or imaginary numbers 4. character: elements made up of text-strings (e.g. "text", "Hello World!", or "123") 5. logical: data containing logical constants (i.e. TRUE and FALSE) Vectors • A vector is simply a list of items that are of the same type. • To combine the list of items to a vector, use the c() function and separate the items by a comma. • Vectors are the most basic R data objects and there are six types of atomic vectors. • They are logical, integer, double, complex, character and raw. • Single Element Vector :Even when you write just one value in R, it becomes a vector of length 1 and belongs to one of the above vector types. Vectors # Atomic vector of type character. print("abc"); # Atomic vector of type double. print(12.5) # Atomic vector of type integer. print(63L) # Atomic vector of type logical. print(TRUE) # Atomic vector of type complex. print(2+3i) # Atomic vector of type raw. print(charToRaw('hello')) [1] "abc“ [1] 12.5 [1] 63 [1] TRUE [1] 2+3i [1] 68 65 6c 6c 6f Vectors Multiple Elements Vector • To create a vector with numerical values in a sequence, use the : operator • # Creating a sequence from 5 to 13. v <- 5:13 print(v) • # Creating a sequence from 6.6 to 12.6. v <- 6.6:12.6 print(v) • # If the final element specified does not belong to the sequence then it is discarded. v <- 3.8:11.4 print(v) Output: [1] 5 6 7 8 9 10 11 12 13 [1] 6.6 7.6 8.6 9.6 10.6 11.6 12.6 [1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8 Vectors Using sequence (Seq) operator • # Create vector with elements from 5 to 9 incrementing by 0.4 Output: [1] 5.0 5.4 5.8 6.2 6.6 7.0 7.4 7.8 8.2 8.6 9.0 print(seq(5, 9, by = 0.4)) # Create vector with elements from 5 to 9 incrementing by 0.6 [1] 5.0 5.6 6.2 6.8 7.4 8.0 8.6 print(seq(5, 9, by = 0.6)) Vectors • Using the c() function • The non-character values are coerced to character type if one of the elements is a character. • # The logical and numeric values are converted to characters. s <- c('apple','red',5,TRUE) print(s) Output: [1] "apple" "red" "5" "TRUE" Vectors • Accessing Vector Elements • Elements of a Vector are accessed using indexing. • The [ ] brackets are used for indexing. Indexing starts with position 1. • Giving a negative value in the index drops that element from result. • TRUE, FALSE or 0 and 1 can also be used for indexing. Vectors • # Accessing vector elements using position. t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat") u <- t[c(2,3,6)] print(u) • # Accessing vector elements using logical indexing. v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)] print(v) • # Accessing vector elements using negative indexing. x <- t[c(-2,-5)] print(x) • # Accessing vector elements using 0/1 indexing. y <- t[c(0,0,0,0,0,0,1)] print(y) Output: [1] "Mon" "Tue" "Fri“ [1] "Sun" "Fri“ [1] "Sun" "Tue" "Wed" "Fri" "Sat“ [1] "Sun" Vectors • # Accessing vector elements using position. t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat") u <- t[c(2,3,6)] print(u) • # Accessing vector elements using logical indexing. v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)] print(v) • # Accessing vector elements using negative indexing. x <- t[c(-2,-5)] print(x) • # Accessing vector elements using 0/1 indexing. y <- t[c(0,0,0,0,0,0,1)] print(y) Output: [1] "Mon" "Tue" "Fri“ [1] "Sun" "Fri“ [1] "Sun" "Tue" "Wed" "Fri" "Sat“ [1] "Sun" Vector Manipulation • To find out how many items a vector has, use the length() function: >fruits <- c("banana", "apple", "orange") >length(fruits) • When we execute the above code, it produces the following result – [1] 3 Vector Manipulation • Two vectors of same length can be added, subtracted, multiplied or divided giving the result as a vector output. # Create two vectors. v1 <- c(3,8,4,5,0,11) v2 <- c(4,11,0,8,1,2) # Vector addition. result <- v1+v2 Output: print(result) [1] 7 19 4 13 1 13 Vector Manipulation • # Vector subtraction. sub.result <- v1-v2 print(sub.result) • # Vector multiplication. multi.result <- v1*v2 print(multi.result) • # Vector division. divi.result <- v1/v2 print(divi.result) Output: [1] -1 -3 4 -3 -1 9 [1] 12 88 0 40 0 22 [1] 0.7500000 0.7272727 0.0000000 5.5000000 Inf 0.6250000 Vector Element Recycling • If we apply arithmetic operations to two vectors of unequal length, then the elements of the shorter vector are recycled to complete the operations. v1 <- c(3,8,4,5,0,11) v2 <- c(4,11) # V2 becomes c(4,11,4,11,4,11) add.result <- v1+v2 print(add.result) [1] 7 19 8 16 4 22 sub.result <- v1-v2 print(sub.result) [1] -1 -3 0 -6 -4 0 • Elements in a vector can be sorted using the sort() function. v <- c(3,8,4,5,0,11, -9, 304) # Sort the elements of the vector. sort.result <- sort(v) print(sort.result) # Sorting character vectors. v <- c("Red","Blue","yellow","violet") sort.result <- sort(v) print(sort.result) # Sort the elements in the reverse order. revsort.result <- sort(v, decreasing = TRUE) print(revsort.result) • Change an Item To change the value of a specific item, refer to the index number: Example: fruits <- c("banana", "apple", "orange", "mango", "lemon") # Change "banana" to "pear" fruits[1] <- "pear" # Print fruits fruits Output: [1] "pear" "apple" "orange" "mango" "lemon" Lists • Lists are the R objects which contain elements of different types like − numbers, strings, vectors and another list inside it. • A list can also contain a matrix or a function as its elements. • List is created using list() function. • Following is an example to create a list containing strings, numbers, vectors and a logical values. • # Create a list containing strings, numbers, vectors and a logical values. list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1) print(list_data) Lists list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1) print(list_data) [[1]] [1] "Red" [[4]] [1] TRUE [[2]] [1] "Green" [[5]] [1] 51.23 [[3]] [1] 21 32 11 [[6]] [1] 119.1 • The list elements can be given names and they can be accessed using these names. • # Create a list containing a vector, a matrix and a list. list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2), list("green",12.3)) # Give names to the elements in the list. names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list") # Show the list. print(list_data) $`1st_Quarter` [1] "Jan" "Feb" "Mar" $A_Matrix [,1] [,2] [,3] [1,] 3 5 -2 [2,] 9 1 8 $A_Inner_list $A_Inner_list[[1]] [1] "green" $A_Inner_list[[2]] [1] 12.3 Accessing List Elements • Elements of the list can be accessed by the index of the element in the list. • In case of named lists it can also be accessed using the names. Accessing List Elements • # Create a list containing a vector, a matrix and a list. list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2), list("green",12.3)) • # Give names to the elements in the list. names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list") • # Access the first element of the list. print(list_data[1]) • $`1st Quarter` • [1] "Jan" "Feb" "Mar" Accessing List Elements • # Create a list containing a vector, a matrix and a list. • list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2), list("green",12.3)) • # Give names to the elements in the list. • names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list") • # Access the thrid element. As it is also a list, all its elements will be printed. • print(list_data[3]) • $`A Inner list` • $`A Inner list`[[1]] • [1] "green" • $`A Inner list`[[2]] • [1] 12.3 Accessing List Elements • # Create a list containing a vector, a matrix and a list. • list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2), list("green",12.3)) • # Give names to the elements in the list. • names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list") • # Access the list element using the name of the element. • print(list_data$A_Matrix) [,1] [,2] [,3] • [1,] 3 5 -2 • [2,] 9 1 8 Manipulating List Elements • We can add, delete and update list elements. • We can add and delete elements only at the end of a list. • But we can update any element. Manipulating List Elements • # Create a list containing a vector, a matrix and a list. list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2), list("green",12.3)) # Give names to the elements in the list. names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list") # Add element at the end of the list. list_data[4] <- "New element" print(list_data[4]) [[1]] [1] "New element" Manipulating List Elements • # Create a list containing a vector, a matrix and a list. list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2), list("green",12.3)) # Give names to the elements in the list. names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list") # Remove the last element. list_data[3] <- NULL print(list_data) $`1st Quarter` [1] "Jan" "Feb" "Mar" $A_Matrix [,1] [,2] [,3] [1,] 3 5 -2 [2,] 9 1 8 Manipulating List Elements # Create a list containing a vector, a matrix and a list. list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2), list("green",12.3)) # Give names to the elements in the list. names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list") # Remove the last element. list_data[3] <- NULL • # Print the 4th Element. • print(list_data[3]) $<NA> NULL Manipulating List Elements # Create a list containing a vector, a matrix and a list. list_data <- list(c("Jan","Feb","Mar"), matrix(c(3,9,5,1,-2,8), nrow = 2), list("green",12.3)) # Give names to the elements in the list. names(list_data) <- c("1st Quarter", "A_Matrix", "A Inner list") # Update the 3rd Element. list_data[3] <- "updated element" print(list_data[3]) $`A Inner list` [1] "updated element" Merging Lists • You can merge many lists into one list by placing all the lists inside one list() function. # Create two lists. list1 <- list(1,2,3) list2 <- list("Sun","Mon","Tue") # Merge the two lists. Merged.list<- c(list1,list2) # Print the merged list. print(merged.list) [[1]] [1] 1 [[2]] [1] 2 [[3]] [1] 3 [[4]] [1] "Sun" [[5]] [1] "Mon" [[6]] [1] "Tue" Converting List to Vector • A list can be converted to a vector so that the elements of the vector can be used for further manipulation. • All the arithmetic operations on vectors can be applied after the list is converted into vectors. • To do this conversion, we use the unlist() function. • It takes the list as input and produces a vector. Converting List to Vector # Create lists. list1 <- list(1:5) print(list1) [[1]] [1] 1 2 3 4 5 list2 <-list(10:14) print(list2) [[1]] [1] 10 11 12 13 14 # Convert the lists to vectors. v1 <- unlist(list1) v2 <- unlist(list2) print(v1) [1] 1 2 3 4 5 print(v2) [1] 10 11 12 13 14 # Now add the vectors result <- v1+v2 print(result) [1] 11 13 15 17 19 Matrices • Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout. • They contain elements of the same atomic types. • We use matrices containing numeric elements to be used in mathematical calculations. • A matrix can be created with the matrix() function. Specify the nrow and ncol parameters to get the amount of rows and columns. Matrices # Create a matrix thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2) # Print the matrix thismatrix [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 Matrices thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol=3) thismatrix [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 Matrices • You can also create a matrix with strings: mat <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) mat [,1] [,2] [1,] "apple" "cherry" [2,] "banana" "orange" Access Matrix Items • You can access the items by using [ ] brackets. • The first number "1" in the bracket specifies the row-position, while the second number "2" specifies the column-position. thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) thismatrix[1, 2] [1] "cherry" • The whole row can be accessed if you specify a comma after the number in the bracket: > thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) > thismatrix [,1] [,2] [1,] "apple" "cherry" [2,] "banana" "orange“ > thismatrix[2,] [1] "banana" "orange" • The whole column can be accessed if you specify a comma before the number in the bracket: > thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) > thismatrix[,2] [1] "cherry" "orange" • More than one row can be accessed if you use the c() function: thismatrix <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) > thismatrix [,1] [,2] [,3] [1,] "apple" "orange" "pear" [2,] "banana" "grape" "melon" [3,] "cherry" "pineapple" "fig" > thismatrix[c(1,2),] [,1] [,2] [,3] [1,] "apple" "orange" "pear" [2,] "banana" "grape" "melon" • More than one column can be accessed if you use the c() function: > thismatrix <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) > thismatrix[, c(1,2)] [,1] [,2] [1,] "apple" "orange" [2,] "banana" "grape" [3,] "cherry" "pineapple" • We can use the cbind() function to add additional columns in a Matrix. • But ensure that the cells in the new column must be of the same length as the existing matrix. thismatrix <matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", " melon", "fig"), nrow = 3, ncol = 3) [,1] [,2] [,3] [,4] [1,] "apple" "orange" "pear" newmatrix <- cbind(thismatrix, "strawberry" c("strawberry", "blueberry", "raspberry")) [2,] "banana" "grape" "melon" "blueberry" # Print the new matrix [3,] "cherry" "pineapple" "fig" newmatrix "raspberry" • We can use the rbind() function to add additional rows in a Matrix. • But ensure that the cells in the new row must be of the same length as the existing matrix. thismatrix <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"), nrow = 3, ncol = 3) newmatrix <- rbind(thismatrix, c("strawberry", "blueberry", "raspberry")) # Print the new matrix newmatrix • Again, you can use the rbind() or cbind() function to combine two or more matrices together: # Combine matrices Matrix1 <- matrix(c("apple", "banana", "cherry", "grape"), nrow = 2, ncol = 2) Matrix2 <- matrix(c("orange", "mango", "pineapple", "watermelon"), nrow = 2, ncol = 2) # Adding it as a rows Matrix_Combined <- rbind(Matrix1, Matrix2) Matrix_Combined • Again, you can use the rbind() or cbind() function to combine two or more matrices together: # Combine matrices Matrix1 <- matrix(c("apple", "banana", "cherry", "grape"), nrow = 2, ncol = 2) Matrix2 <- matrix(c("orange", "mango", "pineapple", "watermelon"), nrow = 2, ncol = 2) # Adding it as a columns Matrix_Combined <- cbind(Matrix1, Matrix2) Matrix_Combined • We can use the c() function to remove rows and columns in a Matrix. thismatrix <- matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"), nrow = 3, ncol =2) #Remove the first row and the first column thismatrix <- thismatrix[-c(1), -c(1)] thismatrix [,1] [,2] [1,] "apple" "orange" [2,] "banana" "mango" [3,] "cherry" "pineapple [1] "mango" "pineapple" Check if an Item Exists • To find out if a specified item is present in a matrix, use the %in% operator: Example: Check if "apple" is present in the matrix: thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) "apple" %in% thismatrix • Output for above code will be: [1] TRUE • Use the dim() function to find the number of rows and columns in a Matrix: thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) dim(thismatrix) • Output for above code will be: [1] 2 2 • We can use the length() function to find the dimension of a Matrix. thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2) length(thismatrix) Output will be: [1] 4 Arrays • Arrays are the R data objects which can store data in more than two dimensions. • For example − If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. • Arrays can store only one data type elements. • An array is created using the array() function. • It takes vectors as input and uses the values in the dim parameter to create an array. Arrays Example :Create an array of two 3x3 matrices each with 3 rows and 3 columns. # Create two vectors of different lengths. vector1 <- c(5,9,3) vector2 <- c(10,11,12,13,14,15) # Take these vectors as input to the array. result <- array(c(vector1,vector2),dim = c(3,3,2)) print(result) • Here the first and second number in the bracket specifies the number of rows and columns. • The last number in the bracket specifies how many dimensions we want. Access Array Items • You can access the array elements by referring to the index position. • You can use the [] brackets to access the desired elements from an array. • The syntax is as follows: array(row position, column position, matrix level) thisarray <- c(1:24) multiarray <- array(thisarray, dim = c(4, 3, 2)) print(multiarray) multiarray[2, 3, 2] [1] 22 • You can also access the whole row or column from a matrix in an array, by using the c() function. thisarray <- c(1:24) # Access all the items from the first row from matrix one multiarray <- array(thisarray, dim = c(4, 3, 2)) multiarray() multiarray[c(1),,1] • You can also access the whole row or column from a matrix in an array, by using the c() function. thisarray <- c(1:24) # Access all the items from the first row from matrix one multiarray <- array(thisarray, dim = c(4, 3, 2)) multiarray # Access all the items from the first column from matrix one multiarray <- array(thisarray, dim = c(4, 3, 2)) multiarray[,c(1),1] [1] 1 2 3 4 Data Frames • A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. • These are data displayed in a format as a table. • Data Frames can have different types of data inside it. • While the first column can be character, the second and third can be numeric or logical. • However, each column should have the same type of data. Data Frames Following are the characteristics of a data frame. • The column names should be non-empty. • The row names should be unique. • The data stored in a data frame can be of numeric, factor or character type. • Each column should contain same number of data items. Data Frames • We can use the data.frame() function to create a data frame. # Create a data frame Data_Frame <- data.frame (Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120), Duration = c(60, 30, 45) ) # Print the data frame Data_Frame • The structure of the data frame can be seen by using str() function. > str(Data_Frame) 'data.frame': 3 obs. of 3 variables: $ Training: chr "Strength" "Stamina" "Other" $ Pulse : num 100 150 120 $ Duration: num 60 30 45 • We can use the dim() function to find the amount of rows and columns in a Data Frame. • Also we can also use the ncol() function to find the number of columns and nrow() to find the number of rows. Data_Frame <- data.frame ( Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120), Duration = c(60, 30, 45) ) > dim(Data_Frame) [1] 3 3 > ncol(Data_Frame) [1] 3 > nrow(Data_Frame) [1] 3 Access Items • We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a data frame. Data_Frame <- data.frame ( Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120), Duration = c(60, 30, 45) ) Data_Frame[1] Access Items • We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a data frame. Data_Frame <- data.frame ( Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120), Duration = c(60, 30, 45) ) Data_Frame[["Training"]] [1] "Strength" "Stamina" "Other" Access Items • We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a data frame. Data_Frame <- data.frame ( Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120), Duration = c(60, 30, 45) ) print(Data_Frame$Pulse) [1] 100 150 120 Access Items • Extract 2nd and 3rd row with 1st and 2nd column result <-Data_Frame[c(2,3),c(1,2)] print(result) Training Pulse 2 Stamina 150 3 Other 120 Expand Data Frame • A data frame can be expanded by adding columns and rows. • We can use the rbind() function to add new rows in a Data Frame. Data_Frame <- data.frame ( Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120), Duration = c(60, 30, 45) ) # Add a new row New_row_DF <- rbind(Data_Frame, c("Strength", 110, 110)) # Print the new row New_row_DF • We can use the cbind() function to add new columns in a Data Frame. # Add a new column New_col_DF <- cbind(Data_Frame, Steps = c(1000, 6000, 2000)) # Print the new column New_col_DF Training Pulse Duration 1 Strength 100 60 2 Stamina 150 30 3 Other 120 45 Steps 1000 6000 2000 • We can use the c() function to remove rows and columns in a Data Frame. Data_Frame <- data.frame ( Training = c("Strength", "Stamina", "Other"), Pulse = c(100, 150, 120), Duration = c(60, 30, 45) ) # Remove the first row and column Data_Frame_New <- Data_Frame[-c(1), -c(1)] # Print the new data frame Data_Frame_New Factors • Factors are the data objects which are used to categorize the data and store it as levels. • They can store both strings and integers. • They are useful in the columns which have a limited number of unique values. • They are useful in data analysis for statistical modeling. Examples of factors are: • Demography: Male/Female • Music: Rock, Pop, Classic, Jazz • Training: Strength, Stamina Factors • Factors are created using the factor () function by taking a vector as input. # Create a factor music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) # Print the factor music_genre [1] Jazz Rock Classic Classic Pop Jazz Rock Jazz Levels: Classic Jazz Pop Rock We can see from the example above that the factor has four levels (categories): Classic, Jazz, Pop and Rock. Factors • To only print the levels, use the levels() function: music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) levels(music_genre) [1] "Classic" "Jazz" "Pop" "Rock" Factors To access the items in a factor, refer to the index number, using [] brackets. For example to access the third item: music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) music_genre[3] Output will be: [1] Classic Levels: Classic Jazz Pop Rock Factors • To change the value of a specific item, refer to the index number. • For example to change the value of the third item: music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz")) music_genre[3] <- "Pop" music_genre[3] output will be: [1] Pop Levels: Classic Jazz Pop Rock Missing values NA • Missing data or values occurs when the data record is absent in the variable. • This will cause serious issues in the data modeling process if not treated properly. • Above all, most of the algorithms are not comfortable with missing data. • There are many ways to handle missing data in R. You can drop those records. • But, keep in mind that you are dropping information when you do so and may lose a potential edge in modelling. • Some functions do not work with their default settings when there are missing values in the data, and mean is a classic example of this: x<-c(1:8,NA) mean(x) [1] NA Missing values NA • In order to calculate the mean of the non-missing values, you need to specify that the NA are to be removed, using the na.rm=TRUE argument: mean(x,na.rm=T) [1] 4.5 • Here is an example where we want to find the locations (7 and 8) of missing values within a vector called vmv: vmv<-c(1:6,NA,NA,9:12) vmv [1] 1 2 3 4 5 6 NA NA 9 10 11 12 Data Manipulation Techniques • In a data analysis process, the data has to be altered, sampled, reduced or elaborated. • Such actions are called data manipulation. • The sort() and the order() functions are included in the base package of R and are used to sort or order the data in the desired order. Data Manipulation Techniques • The sort() function sorts the elements of a vector or a factor in increasing or decreasing order. • The syntax of the sort function is: sort(x, decreasing = FALSE, na.last = NA, . . .) • x is the input vector or factor that has to be sorted. • decreasing determinines decreasing order (TRUE) or in increasing order (FALSE). • na.last controls the treatment of the NA values present inside the input vector/factor. • If na.last =TRUE, then the NA values are put at the last. Else na.last= FALSE, then the NA values are put first. Finally, if it is set as NA, then the NA values are removed. Data Manipulation Techniques sort(c(3,16,34,77,29,95,24,47,92,64,43), decreasing = FALSE) [1] 3 16 24 29 34 43 47 64 77 92 95 sort(c(3,16,34,77,29,95,24,47,92,64,43), decreasing = TRUE) [1] 95 92 77 64 47 43 34 29 24 16 3 sort(c(3,16,34,77,29,95,24,47,92,64,43)) [1] 3 16 24 29 34 43 47 64 77 92 95 sort(c(3,16,34,77,29,95,24,47,92,64,43), na.last=TRUE) [1] 3 16 24 29 34 43 47 64 77 92 95 sort(c(3,16,34,77,29,95,24,47,92,64,43), na.last=NA) [1] 3 16 24 29 34 43 47 64 77 92 95 Data Manipulation Techniques The order() function returns the indices of the elements of the input objects in ascending or descending order. order(. . . , na.last = TRUE, decreasing = FALSE, method = c("auto", "shell", "radix")) . . . is a sequence of numeric, character, logical or complex vectors or is a classed R object. na.last is the argument that controls the treatment of NA values. decreasing controls whether the order of the object will be decreasing or increasing. method is a character string that specifies the algorithm to be used. method can take the value of “auto”, “radix”, or “shell”. Data Manipulation Techniques Ex1: a <- c(20,40,70,10,50,30,90,60) a[order(a)] [1] 10 20 30 40 50 60 70 90 Ex2: #creates a vector x<-c(3.5,7.8,5.6,1.1,2.9,4.4) #orders the data in the decreasing fashion x[order(x,decreasing = T)] [1] 7.8 5.6 4.4 3.5 2.9 1.1 Data Manipulation Techniques • Sample() function in R, generates a sample of the specified size from the data set or elements, either with or without replacement. • Sample() function is used to get the sample of a numeric and character vector and also a dataframe. sample(x, size, replace = FALSE, prob = NULL) x size replace prob Data Set or a vector of one or more elements from which sample is to be chosen size of a sample Should sampling be with replacement? probability weights for obtaining the elements of the vector being sampled Data Manipulation Techniques Example: That generates 10 random sample from vector of 1 to 20. With replacement =TRUE. • which means, value in the sample can occur more than once. sample(1:20, 10, replace=TRUE) [1] 10 16 6 13 6 4 6 1 12 6 sample(1:20, 10, replace=TRUE) [1] 5 15 9 4 20 17 6 11 16 3 Data Manipulation Techniques sample(1:5,10,replace=FALSE) Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when 'replace = FALSE' sample(1:5,10,replace=TRUE) [1] 1 2 1 3 3 1 4 5 1 4 Merging/combining datasets in R • The cbind() function combines two dataset (or data frames) along their columns. m1<-c(1:4) m2<-c(5:8) cbind(m1,m2) m1<-matrix(c(1:4),nrow=2,ncol=2) m1 m2 m2<-matrix(c(5:8),nrow=2,ncol=2) [1,] 1 5 cbind(m1,m2) [2,] 2 6 [,1] [,2] [,3] [,4] [3,] 3 7 [1,] 1 3 5 7 [4,] 4 8 [2,] 2 4 6 8 Merging/combining datasets in R • The rbind() function combines two data frames along their rows. m1<-matrix(c(1:4),nrow=2,ncol=2) m2<-matrix(c(5:8),nrow=2,ncol=2) rbind(m1,m2) [,1] [,2] [1,] 1 3 [2,] 2 4 [3,] 5 7 [4,] 6 8 Merging/combining datasets in R • The merge() function performs what is called a join operation in databases. • This function combines two data frames based on common columns. m1<-matrix(1:6,nrow=2,ncol=3) m2<-matrix(11:16, nrow=2,ncol=3) names <- c('v1','v2','v3') colnames(m1) <- names colnames(m2) <- names merge(m1,m2, by = names, all = TRUE) Usage of various apply functions • The apply in R function can be feed with many functions to perform redundant application on a collection of object (data frame, list, vector, etc.). • The purpose of apply() is primarily to avoid explicit uses of loop constructs. • apply() takes Data frame or matrix as an input and gives output in vector, list or array. apply function • This function takes 3 arguments: apply(X, MARGIN, FUN) • Here: -x: an array or matrix -MARGIN: takes a value or range between 1 and 2 to define where to apply the function -MARGIN=1`: the manipulation is performed on rows -MARGIN=2`: the manipulation is performed on columns -MARGIN=c(1,2)` the manipulation is performed on rows and columns -FUN: tells which function to apply. Built functions like mean, median, sum, min, max and even user-defined functions can be applied. apply function m1 <- matrix(C<-(1:10),nrow=5, ncol=6) m1 a_m1 <- apply(m1, 2, sum) a_m1 • The code apply(m1, 2, sum) will apply the sum function to the matrix 5×6 and return the sum of each column accessible in the dataset. lapply function • lapply() function is useful for performing operations on list objects and returns a list object of same length of original set. • lappy() returns a list of the similar length as input list object, each element of which is the result of applying FUN to the corresponding element of list. • lapply in R takes list, vector or data frame as input and gives output in list. • lapply(X, FUN) • Arguments: -X: A vector or an object -FUN: Function applied to each element of x lapply function movies <- c("SPYDERMAN","BATMAN","VERTIGO","CHINATOWN") movies_lower <-lapply(movies, tolower) str(movies_lower) sapply function • sapply() function takes list, vector or data frame as input and gives output in vector or matrix. • It is useful for operations on list objects and returns a list object of same length of original set. • Sapply function in R does the same job as lapply() function but returns a vector. sapply(X, FUN) • Arguments: -X: A vector or an object -FUN: Function applied to each element of x sapply function • Example: We can measure the minimum speed and stopping distances of cars from the cars dataset (the data were recorded in the year 1920 and a data frame with 50 observations on 2 variables.). dt <- cars lmn_cars <- lapply(dt, min) smn_cars <- sapply(dt, min) lmn_cars smn_cars tapply function • tapply() computes a measure (mean, median, min, max, etc..) or a function for each factor variable in a vector. • It is a very useful function that lets you create a subset of a vector and then apply some functions to each of the subset. tapply(X, INDEX, FUN = NULL) • Arguments: -X: An object, usually a vector -INDEX: A list containing factor -FUN: Function applied to each element of x tapply function • To understand how it works, let’s use the iris dataset. • The purpose of this dataset is to predict the class of each of the three flower species: Sepal, Versicolor, Virginica. • The dataset collects information for each species about their length and width. • We can compute the median of the length for each species using tapply() very quickly. data(iris) tapply(iris$Sepal.Width, iris$Species, median) setosa versicolor virginica 3.4 2.8 3.0