Module overview Julien Gagneur Tidy data and combining tables 3 / 59 Introduction to R, RStudio and R markdown Rstudio Rstudio is a software that allows to program in R and interactively analyze data with R It organizes the R session into 4 panels: Julien Gagneur Lecture 1 - R Basics 6 / 38 Introduction to R, RStudio and R markdown R markdown R markdown allows us to combine R commands with natural text Create an R markdown file: File -> New file -> R markdown Use Shift+Ctrl+K to start the knit process or press the button Knit Each R markdown document contains a YAML header, defining document wide settings like title, author and output type (html, pdf, doc . . . ) R markdown cheatsheet [https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf] Julien Gagneur Lecture 1 - R Basics 7 / 38 Introduction to R, RStudio and R markdown Installing and loading packages Packages are the fundamental units of reproducible R code. Several packages are automatically included when installing R. We can install and load new packages by typing: install.packages("vegan") # install new package called vegan, do only once! library(vegan) # and load it (on every script) Vegan is a package to analyze biodiversity. To lean more about an installed package try: browseVignettes(“vegan”) Julien Gagneur Lecture 1 - R Basics 8 / 38 OPIC 1 : RBASKS • • Lecture AH qvestions First • Central exercise Stack in with R Steps Assignment s We I use Example • and tutorial , - to dssign use to variables . : Objects We va / ves the term Variable or print / variable , p object } I. variables Functions y / , to describe " MM " " stvff that " see the value is , . . . ) Store d in R . stored ir a va , , ¿ me • B- meta OMSS Performs Take E. Tasks In R in inputs Function 's g can catled default Square Is I I Command ? toy " sqrt ) " speclfled • . sqrtal help ( be arguments return Function 's outputs default value] . va / ves root of } and Manual /y to a see alt the variables Find out what the what saved in my Workspace function experts it does and . Other • prebuilt objects data l ) • Variable : Convention Lower see datasets alt the available . Case , use : and meaningful underscores ( use Word, that • ) di a describe what Substitute is stored use , for Spaces . Reusing scripts That 's • to ndmes Nile • dllows redefine Comment # the variables that I Ing your Create and re compute the so Code It's not evaluaied particular already Code . . Use this to Write reminders ofwhy we wroie / vtion . Types Data Variables can class ( ) Example • different types be of Helps determine vs Ey . what numbers type of , characters object we Tables , Hits have . . : Data Frame Table with rows ( ( representan y ( representa ng olumns / # # too Find King out toading Loadvng at the Structure more Observations the the and different the variables for each observaron Datagramas Useful reported ' library murder s data set cause of a we different into about the Structure strlmurders ) ) One data Frame of an object with the function sir can ) . combine data object types . lo / vmn OF mames a Frame Get the columns ( mvrders ) names data with the function First lines n of head ( murder s n =3 [ 1 :3 Access a Use the , frame . Y Show the first variable ir a data operator access or murders data r ( mvrders ) head name a ) or Murder s oj data Frame a , memes Frame $ $ population t t Select ed dataset MÍANS access Variable the differeni variables represente d by columns included dataset . in the six lines of the data Frame . . Access Use rows single and ( Olumns Square brdlkets Of a data Frame [ | • Vector s with Objects pop murders # C- length entries . } populatlon ( pop ) Creating • several 1) Tellus are Vector s stands for in how the Many vector entries . i combine or concatenateentries . * Create vector Nvmerics It's • numerk population nvnbers Characters ' cause Sites . are using the function C Example 2: • me use the characters quoi esto rather denote that than variables country <- c("italy", "canada", "egypt") * We can also Logica / C ( / FALSE (FALSE 3 ! =3 , , FALSE ) TRUE = = FALSEI use single quites ' ' the entres ndmes . are ⑧ 1Kerµ•M•ta•ms4ocreaTer•memc_ • Create Weston seql ) Sequence using • i second arg . Is the flrstargument is 1 - # I increment of I • Third to • Repeat 6 7 the 8 in Sequence 9 6 7 8 9 2 difjerent Ways : the start atgvment how jump by . end multi . . A- crees ing Vectores Access • specific elements of a vector Access • Access Ottmar • more than One Functions entry 1- • brackets to the second element [ ] the vector of • my Square using of app / y vector a on using multi numerical - entry vector as an Index . factors Sort Ing IF we want 5. times this E. to rank function any it's variable from enough not g • result of sort ' M least because to it most . doesrit gire us enough inpormation • Order Takes the a input Index ✗ vector vector L - as and input a re Turn the vector of the Index es that sort , . Order ( ✗ I [ Index ] Example a a 1 2 3 6 Akam Img The Define • enfríes copa vector vector a Nse the name , of to Codes Country Connect the ✓ • Using the Manes function : tuvo Other functions to apply on numerical vectors max and which.max v ## [1] 8 3 4 # maximal value max(v) min Iv ) ## [1] 8 # index containing the maximal value which.max(v) which min (v ) . ## [1] 1 try also min and which.min() Julien Gagneur Lecture 1 - R Basics 37 / 38 • a Which Match vrder $ state ) • % in to state • Factor s Factor s are ( Levels By Useful for Murder s $ default the al Staring categoria data Region is a region levels } ) are lnspect um factor the by using the . que Levels or levels ) of . va / ves t sort ed Example categorias ( . Con struct d Factor by alpha numerical Order . a ¡ actor Recorder a factor The default in R is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. You can specify an order through the levels argument when creating the factor with the factor function. For example, in the murders dataset regions are ordered from east to west. The function reorder lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. Example r For doing this , the value, are associated with each level Data Types in R Not availables (NA) There exists a special value called NA (“not available”) to handle missing data, among other scenarios. We encounter NAs often as missing data is a very common problem in real-world datasets. For example, in the following, there is no temperature measurement on day 2. ## day temperature ## 1 d1 28 ## 2 d2 NA ## 3 d3 31 Another example of NA is when a function tries to coerce one type to another that is not possible. For example: x <- c("1", "y", "3") as.numeric(x) ## Warning: NAs introduced by coercion ## [1] 1 NA 3 R does not have any guesses for what number "y" can be, so it is not able to coerce it. More info in the script! Julien Gagneur Lecture 1 - R Basics 32 / 38 Data Types in R Further data types Other data types in R include: lists contain different types of data matrices for two dimensional data See the script for more information about them! Julien Gagneur Lecture 1 - R Basics 31 / 38 Lisis they Useful because are it can Store any µ P Combination of Data frames Extra [T are Use The ndmes dccessor record $ tu Of IISTS Case Components the with variable • Special of a another Use double record 2 $ • _ te record Square [ [ " student _ Id " }] [[ ]] Using variable L - IIST double hannes / " John square records [ [ 11 ] Component brackets nvmbers chtracter WHHOUT Select ed rector with 5 list I • a . stvdent id / ¡ gj . nvmber : operator types cbaracter ☐ D different : " , 1234 ) brackets > " [[ John ' 1) Matrices Matrices are similar to data frames in that they are two-dimensional: they have rows and columns. However, like numeric, character and logical vectors, entries in matrices have to be all the same type. For this reason data frames are much more useful for storing data, since we can have characters, factors, and numbers in them. Yet matrices have a major advantage over data frames: we can perform matrix algebra operations, a powerful type of mathematical technique. Mat L Access 2nd row Mat • 4 , ncol =3 3rd ) mat Convertir . More than Having [ Mat 10 the I columnar , 2 D column , a } ] , → Matrix data Frame . than Colum . [ g more 2 :3 ] 10 6 a both Subset rows columns and the leaving empty spot row , ] , • D 2nd row [2 Entre colvmn ] } , empty mat ds = [ ] 3rd , [2 Entre spot • nrow , specific entries Use brackets • ( 1:12 Matrix - 9 10 12 data frame in a ( Mat ) 11 . ( onverting a Matrix in a data frame . I row Coerción ' attempt to R Koppers fvntions to Chang H is ds 25 . . an chterzcter numeric lndexing • Logica / operators be to e flexible with data from one type types of data to an other . SUMMARY • tal print • Is ( ) • ? Log • OFCOMMANDS # See at the argsllog ) # at look quick steved variables the my Workspace in arguments Of function to . class ( a ) • ldslabs ) library • • lmvrdersl # • str • head ( Murder s ) • • • • • • o • • • • • • ° o • loading # loading data ( Murder ) / murders ) mames ( pop ) length I ( ds , nrow data Frame ( 1) Creating ( ) # ncol , = of in categoria in a specif entry y anobject reatas in Sequence sort ( X ) Order ( x ) (x ) which . mdx ( # The I entry with the # Index of murders as List / datafvame variables the factor or entres the largest largest value value ir a ] 1 :b / ék stored in the table | nvmeric Max Same the vector pop vector Creating a object Lisi a # Convertir a lines Six character . . # = an for each of the are the to access ) - seq ds 1:12 entres see Structure of Components name Creating # Matrix • # # [ } . Many a first show the # reveale the I about more extiact the # How library data a can ( mvrders $ region ) levels as we # popvlttion murders $ list # Ending dshobs the # data Frame • o • ☐ • • • • ) min ( which rank min . ( } vector with the rank of the firsi Cary I ) match ( 1 1 Table ( x ) insta II. | # returns ) which ( % L . # Tdke pdckages (" One or Multiple Package Name " ) vector , and returns the frequency of each element . Base R Cheat Sheet Getting Help Accessing the help files ?mean Get help of a particular function. help.search(‘weighted mean’) Search the help files for a word or phrase. help(package = ‘dplyr’) Find help for a package. More about an object str(iris) Get a summary of an object’s structure. class(iris) Find the class an object belongs to. Using Packages install.packages(‘dplyr’) Download and install a package from CRAN. library(dplyr) Load the package into the session, making all its functions available to use. dplyr::select Use a particular function from a package. data(iris) Load a built-in dataset into the environment. Working Directory setwd(‘C://file/path’) c(2, 4, 6) Join elements into a vector 2 4 6 2:6 2 3 4 5 6 An integer sequence seq(2, 3, by=0.5) 2.0 2.5 3.0 A complex sequence While Loop for (variable in sequence){ while (condition){ Do something Do something } } Example Example for (i in 1:4){ rep(1:2, times=3) 1 2 1 2 1 2 Repeat a vector rep(1:2, each=3) 1 1 1 2 2 2 Repeat elements of a vector while (i < 5){ j <- i + 10 print(i) print(j) i <- i + 1 } } Vector Functions sort(x) Return x sorted. table(x) See counts of values. rev(x) Return x reversed. unique(x) See unique values. Selecting Vector Elements Functions If Statements function_name <- function(var){ if (condition){ Do something } else { Do something different } Do something } return(new_variable) Example Example By Position x[4] square <- function(x){ if (i > 3){ print(‘Yes’) The fourth element. squared <- x*x } else { x[-4] All but the fourth. x[2:4] Elements two to four. x[-(2:4)] All elements except two to four. x[c(1, 5)] Elements one and five. print(‘No’) return(squared) } } Reading and Writing Data Input Also see the readr package. Ouput Description write.table(df, ‘file.txt’) Read and write a delimited text file. df <- read.csv(‘file.csv’) write.csv(df, ‘file.csv’) Read and write a comma separated value file. This is a special case of read.table/ write.table. load(‘file.RData’) save(df, file = ’file.Rdata’) Read and write an R data file, a file type special for R. df <- read.table(‘file.txt’) By Value x[x == 10] Elements which are equal to 10. x[x < 0] All elements less than zero. x[x %in% c(1, 2, 5)] Change the current working directory. Use projects in RStudio to set the working directory to the folder you are working in. For Loop Creating Vectors getwd() Find the current working directory (where inputs are found and outputs are sent). Programming Vectors Elements in the set 1, 2, 5. Named Vectors x[‘apple’] RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • mhairihmcneill@gmail.com Element with name ‘apple’. Conditions a == b Are equal a > b Greater than a >= b Greater than or equal to is.na(a) Is missing a != b Not equal a < b Less than a <= b Less than or equal to is.null(a) Is null Learn more at web page or vignette • package version • Updated: 3/15 Types Converting between common data types in R. Can always go from a higher value in the table to a lower value. Matrices Strings m <- matrix(x, nrow = 3, ncol = 3) Create a matrix from x. paste(x, y, sep = ' ') TRUE, FALSE, TRUE as.numeric Boolean values (TRUE or FALSE). Integers or floating point numbers. 1, 0, 1 as.character '1', '0', '1' Character strings. Generally preferred to factors. as.factor '1', '0', '1', levels: '1', '0' Character strings with preset levels. Needed for some statistical models. log(x) Natural log. sum(x) Sum. exp(x) Exponential. mean(x) Mean. max(x) Largest element. median(x) Median. min(x) Smallest element. quantile(x) Percentage quantiles. Round to n decimal places. rank(x) Round to n significant figures. var(x) Correlation. sd(x) round(x, n) signif(x, n) cor(x, y) Rank of elements. The variance. m[ , 1] - Select a column m[2, 3] - Select an element The standard deviation. > a <- 'apple' > a [1] 'apple' l[1] l$x l['y'] Second element of l. New list with only the first element. Element named x. New list with only element named y. Also see the dplyr package. Data Frames df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) A special case of a list where all elements are the same length. x y 1 a 2 b The Environment ls() List all variables in the environment. rm(x) Remove x from the environment. Matrix subsetting df[ , 2] df[2, ] Remove all variables from the environment. You can use the environment panel in RStudio to browse variables in your environment. c df[2, 2] RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • mhairihmcneill@gmail.com • 844-448-1212 • rstudio.com Join elements of a vector together. Find regular expression matches in x. gsub(pattern, replace, x) Replace matches in x with a string. toupper(x) Convert to uppercase. tolower(x) Convert to lowercase. nchar(x) Number of characters in a string. Factors l[[2]] 3 grep(pattern, x) Lists List subsetting Variable Assignment rm(list = ls()) w w w ww w w w w ww w w w w ww w t(m) Transpose m %*% n Matrix Multiplication solve(m, n) Find x in: m * x = n ] - Select a row l <- list(x = 1:5, y = c('a', 'b')) A list is a collection of elements which can be of different types. Maths Functions Join multiple vectors together. paste(x, collapse = ' ') m[2, as.logical Also see the stringr package. df[[2]] df$x factor(x) Turn a vector into a factor. Can set the levels of the factor and the order. cut(x, breaks = 4) Turn a numeric vector into a factor by ‘cutting’ into sections. Statistics lm(y ~ x, data=df) Linear model. glm(y ~ x, data=df) Generalised linear model. summary Get more detailed information out a model. t.test(x, y) Perform a t-test for difference between means. pairwise.t.test Perform a t-test for paired data. prop.test Test for a difference between proportions. aov Analysis of variance. Distributions Understanding a data frame View(df) See the full data frame. head(df) See the first 6 rows. nrow(df) Number of rows. ncol(df) Number of columns. dim(df) Number of columns and rows. cbind - Bind columns. Random Variates Cumulative Distribution Quantile Normal rnorm dnorm pnorm qnorm Poisson rpois dpois ppois qpois Binomial rbinom dbinom pbinom qbinom Uniform runif dunif punif qunif Plotting rbind - Bind rows. Density Function plot(x) Values of x in order. Dates Also see the ggplot2 package. plot(x, y) Values of x against y. hist(x) Histogram of x. See the lubridate package. Learn more at web page or vignette • package version • Updated: 3/15 Other functions to apply on numerical vectors Curious about learning more R basics? Read the first chapter and appendix of our script! Actively participate in the exercise sessions Ask questions on Slack Practice with DataCamp Julien Gagneur Lecture 1 - R Basics 38 / 38