Table of Contents 1.1 Introduction ........................................................................................................................ 1 1.2 Data Science with R ............................................................................................................ 4 How to work with R ........................................................................................................................ 4 1.3 R Session and Functions...................................................................................................... 7 1.4 Business Analytics, Data and Information ........................................................................ 11 1.5 Basic Math in R ................................................................................................................. 13 1.5.1 Variables in R ............................................................................................................ 21 1.6 Advanced Data Structures in R ......................................................................................... 24 1.7 Understanding Business Analytics with R ......................................................................... 30 1.8 Comparison of R with other Software for Analytics ......................................................... 32 1.9 Installation of R ................................................................................................................. 36 1.10 Using R Command Line Interface...................................................................................... 43 1.11 Exploring and Learning RStudio ........................................................................................ 45 1.11.1 Data Management in RStudio ................................................................................... 48 1.11.2 Importing Data in RStudio ........................................................................................ 49 1.11.3 Exporting, Viewing and Removing Data.................................................................... 51 1.12 Using Help Feature in R .................................................................................................... 52 1.13 Chapter Summary ............................................................................................................. 53 1.14 Online Resources .............................................................................................................. 53 1.15 Exercise ............................................................................................................................. 54 2.1 Introduction to R Programming ........................................................................................ 57 2.2 Variables in R .................................................................................................................... 59 2.2.1 What is Variable? ...................................................................................................... 59 2.2.2 Assigning values to Variables .................................................................................... 60 2.2.3 Good Practices .......................................................................................................... 60 2.2.4 Creating new Variables ............................................................................................. 60 2.2.5 Can we Rename Variables ........................................................................................ 61 2.3 Scalars in R ........................................................................................................................ 61 2.4 Data Types in R ................................................................................................................. 64 2.4.1 Vectors ...................................................................................................................... 64 2.4.2 Matrices .................................................................................................................... 66 2.4.3 What is List in R ......................................................................................................... 70 2.5 Data Frames ...................................................................................................................... 72 2.6 Arrays ................................................................................................................................ 75 2.7 Classes and Objects........................................................................................................... 77 2.8 R Programming Structures................................................................................................ 79 2.8.1 2.9 R Arithmetic Operators ..................................................................................................... 84 2.9.1 2.10 R Control Statements ................................................................................................ 80 Values........................................................................................................................ 86 Functions in R.................................................................................................................... 87 2.10.1 Does R have Pointers? .............................................................................................. 91 2.10.2 Recursion .................................................................................................................. 92 2.11 Use of c and cbind in R...................................................................................................... 94 2.11.1 Rbind/ rbind Function ............................................................................................... 96 2.11.2 R attach() and detach() Functions............................................................................. 97 2.12 Chapter Summary ........................................................................................................... 102 2.13 Online Resources ............................................................................................................ 102 2.14 Exercises.......................................................................................................................... 103 APPENDIX A................................................................................................................................. 105 APPENDIX B ................................................................................................................................. 146 APPENDIX C ................................................................................................................................. 146 APPENDIX D ................................................................................................................................. 148 Bibliography ................................................................................................................................... 149 CHAPTER 01 Introduction to Data Analytics LEARNINGOBJECTIVES In this chapter, you will learn: How to install and configure R along with writing your first program in R A brief Introduction on Data Analytics Learn about IDE named R Studio How and what R Functions are defined and work 1.1 Introduction Nowadays the amount of data generated by wide areas of advanced technologies such as social media networking sites like Instagram, Facebook, Twitter or E-Commerce sites etc., is very huge and it becomes difficult to store such gigantic data by using the traditional data storage facilities. Data is made perpetually, and at an ever-increasing rate. Information from Mobile phones, social media, imaging technologies which determine a medical diagnosis-all these and additional produce new information, which should be kept somewhere for some purpose safe and be able to retrieve it whenever needed. Devices and sensors mechanically generate diagnostic information that must be stored and processed in real time. just maintaining with this large inflow of knowledge is troublesome, however substantially more difficult is analyzing immense amounts of it, particularly once it doesn't adjust to ancient notions of knowledge structure, to spot purposeful patterns and extract Statistics with R Programming helpful info. Several industries have led the means in developing their ability to collect and exploit data: Credit card corporations monitor each purchase their customers create and may determine fallacious purchases with a high degree of accuracy exploitation rules derived by processing billions of transactions. Mobile phone corporations analyze subscribers' calling patterns to see, as an example, whether or not a caller's frequent contacts area unit on a rival network. If that rival network is providing a pretty promotion which may cause the subscriber to defect, the itinerant company will proactively provide the subscriber an incentive to stay in her contract. The valuations of those corporations are heavily derived from the information they gather and host, that contains a lot of and a lot of intrinsic worth because the knowledge grows. Sources of Big Data: Big Data is generated from numerous platforms and technologies these days some of them are: Sources of Big Data Stock Exchange Social Media Data Video Sharing Portals 2 Statistics with R Programming Transport Data Banking Data Stock Exchange: The data in the share market regarding information about prices and status details of share of thousands of companies is very huge. Social Media Data: The data of social networking sites contains information about all the account holders, their posts, chat history, advertisements etc. On popular websites or applications such as Facebook and Instagram, there are billions of users producing huge data overall. Video Sharing Portals: Video sharing portals like YouTube, Vimeo etc., contains millions of videos each of which requires lots of memory to store. Transport Data: Transport data contains information about model, capacity, distance and availability of different vehicles and their status. Banking Data: The big giants in banking domain like State Bank of India or ICICI hold large amount of data regarding huge transactions of account holders. 3 Statistics with R Programming 1.2 Data Science with R Data Science is an interesting study about handling data, rather not just handling but studying the science of data and every aspect of it. It is an amazing discipline that allows you to transform the complex raw data into the knowledgeable, interesting and understanding insight information. The motivation behind R programming for Data Science is to help the users to understand the most important tools of R that will help its users to perform various transformations on data to make use of it. Defining Data Science Data Science is a field of Big Data which searches for providing meaningful information from huge amounts of complex data. Data science is a system used for retrieving the information in different various forms, either in structured or unstructured. Data Science combines different areas or fields of work / study in statistics and computation in order to understand the data for the purpose of decision making. How to work with R First of all to work with data in R you have to import the data using “import” interface button. This is just telling the tool to store the data in a file, database or web Application Interface for reducing loading times into a data frame. Without having data imported in R further process for data science cannot be done! 4 Statistics with R Programming The next thing to do is to tidy your data. Yes, you heard right, tidy your data. Data is in complex form most of the times and hence it has to be made tidy meaning storing your data in consistent manner that matches the linguistics of the dataset with the source it is stored. This concludes that your data is formatted in a tidy manner, each column is a variable, and each row is an observation. This process or step is important as the prominent structure lets you focus the questions arising to you about the data, not being able to get the information out of the complex data from different functions. After the data is kept in a tidy manner the step further is to transform the data. This includes narrowing in on observations of interest (like all individuals in one town, or all knowledge from the last year), making new variables that are functions of existing variables (like computing speed from distance and time), and hard a collection of outline statistics (like counts or means). Together, tidying and reworking are referred to as haggle, as a result of obtaining your knowledge in an exceedingly kind that’s natural to figure with typically sounds like a fight! Visualisation may be a basically human action. But a decent visual image can show you things that you simply failed to expect, or raise new questions about the information. A decent visual image may also hint that you’re asking the incorrect question otherwise you ought to collect completely 5 Statistics with R Programming different information. Visualizations will surprise you but however don’t scale notably well as a result of they need some human to interpret them. Another parameter is Models which are optional tools to visualizations. Once you have got created your queries sufficiently precise, you'll be able to use a model to answer them. Models are basically mathematical or procedure tool so that they typically scale well. The final step of data science is communication, a very essential and absolutely important step of any data analysis project. It doesn’t matter however well your models and visualization have crystal rectifier you to grasp the information unless you'll additionally communicate your results to others. How to run an R Program This section will guide you through how to write and run a code in R programming Before you run R codes you need few things to start with: such as R (the base language), RStudio (an IDE for writing and executing R projects), a collection of R packages called the tidyverse, and a few other important or additional packages. Packages are the basic unit of reusable R code. They include reusable functions, the documentation manual that describe how to use them, and sample data. Couple of examples of R code: 5+2 #> [1] 7 #> [1] 7 6 Statistics with R Programming If you run your code on terminal or console it will look like this: 5+2 > [1] 7 1.3 R Session and Functions Just like some languages in computer science field R is a functional language. Most of the computation in R is handled victimization functions. The R language environment is meant to facilitate the event of latest scientific computation tools. Working with R Session In two methods the R programming can be achieved. One can either type the command lines on the console inside an "R-session", or one can save the commands as a "script" file and execute the whole file inside R. First let us talk about R-session. At the beginning of an R session type ‘R’ on the command line cmd in Windows or terminal in Linux OS. Below is an example, at terminal where shell prompts ‘$’ in Linux type $R This will generate the below given output before entering the '>>' prompt of R: R version 3.2.1 (2015-07-10) -- "Stay With Me" Copyright (C) 2015 The R Foundation for Statistical Computing Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale 7 Statistics with R Programming R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. [Previously saved workspace restored] > After we are inside with the R session, one can directly execute the R language commands by typing them line by line in console window. With pressing the enter key on the keyboard will terminate the typing of command and brings the ‘>’ prompt again. In the example session below, we had declared two variables namely 'a' and 'b' which are assigned values Five and Six respectively and also assigned their addition result to another variable called 'c': > > > > a = 5 b = 6 c = a + b c The value of c will be seen as, [1] 11 In R session, by typing a variable name on the screen will print its value on the console. 8 Statistics with R Programming Saving in the R Session It is to be noted that by failing to save the current session, one can lose all the memory of the current session commands and the variables and objects created after exiting R prompt. When we work with R, the R objects we tend to created and loaded area unit keep during a memory portion referred to as space. After we say 'no' to avoid wasting the space, we tend to of these objects area unit drained from the space memory. If we are saying 'yes', they're saved into a file referred to as ".RData" is written to the current operating directory. In Linux, this "working directory" is mostly the directory from wherever R was started through the command 'R'. In windows, it will be either "My Documents" or user's home directory. When we begin R within the same current directory next time, the work area and every one the created objects area unit fixed mechanically from this ".RData" directory. Exit the R Session One can exit the R session, by typing the quit() command in the R prompt, and say 'n' (no) for saving the workspace image. This simply means that we do not want to save the memory of all the commands we typed in the current session: > quit() Save workspace image? [y/n/c]: n > 9 Statistics with R Programming Functions in R One of the most effective ways in which to enhance your reach as an information soul is to write down functions. Functions permit you to change common tasks during a lot of powerful and general manner than copy-and-pasting. Writing a function has 3 huge blessings over victimization copy-and-paste: 1. You can give a function a memorable or describing name that makes your code easier to understand. 2. As the requirements may change, you only need to update the code in one place, instead of many. 3. You reduce the chance of making incidental errors when you copy and paste (i.e. updating a variable name in one place, but not in another). Writing smart functions could be a time period journey. Even once victimization R for several years I still learn new techniques and higher ways that of approaching recent issues. The goal of this chapter isn't to show you each abstruse detail of functions however to urge you started with some pragmatic recommendation that you simply will apply now. As well as sensible recommendation for writing functions, this chapter additionally offers you some suggestions for a way to vogue your code. Smart code vogue is like correct punctuation. You can manage without it but however it certain makes things easier to read. Such types of punctuation, there are several unit attainable variations. Here we have a tendency to gift the design we have a tendency to use in our code, however the foremost vital issue is to be consistent. Function Definition in R 10 Statistics with R Programming An R function is generated by using the keyword function. The basic syntax of an R function definition is as follows − function_name <- function(arg_1, arg_2, ...) { Function body } Components of Function in R There are different components in R Function 1. Name of the Function – This is the actual name of the function. It is stored in R environment as an object with this name. 2. Arguments − an argument or parameter is a placeholder for values to pass to function while calling the function. When a function is invoked, you pass a value to the argument. Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values. 3. Function Body − the function body contains a collection of statements that defines what the function does. 4. Return Value − the return value of a function is the last expression in the function body to be evaluated. R has many pre-built performing functions which can be directly invoked in the program without defining them first. We can also create and use our own functions referred as user defined or user declared functions. 1.4 Business Analytics, Data and Information 11 Statistics with R Programming Business analytics begins with a data set (a simple collection of data or a data file) or commonly with a database (a collection of data files that contain information on people, locations, and so on). As databases grow, they need to be stored somewhere. Technologies such as computer clouds (hardware and software used for data remote storage, retrieval, and computational functions) and data warehousing (a collection of databases used for reporting and data analysis) store data. Database storage areas have become so large that a new term was devised to describe them. Big data describes the collection of data sets that are so large and complex that software systems are hardly able to process them. Definition of Business Analytics: Business intelligence (BI) can be defined as a set of processes and technologies that convert data into meaningful and useful information for business purposes. Business Analytics Examples Business analytics techniques break down into the two main areas. The primary is basic business intelligence. This involves examining historical knowledge to urge a way of however a department, team or staffer performed over a selected time. This can be a mature observe that almost all enterprises area unit fairly accomplished at exploitation. The second space of business analytics involves deeper applied math analysis. This could mean doing prophetical analytics by applying applied math algorithms to historical knowledge to form a prediction concerning future performance of a product, service or web site style amendment. Or, it might mean victimization alternative advanced analytics techniques, like cluster analysis, to cluster customers supported similarities across many knowledge points. This will be useful in targeted selling campaigns, for instance. Some various types of analytics include: 1. Descriptive analytics, which keeps track of key performance indicators to understand the present state of a business. 2. Predictive analytics that analyzes trend knowledge to assess the chance of future outcomes. 3. Prescriptive analytics, that uses past performance to come up with recommendations regarding the way to handle similar things within the future. 12 Statistics with R Programming Business Analytics vs. Data Science The additional advanced areas of business analytics will begin to corresponding knowledge science, however there's a distinction. Even once advanced applied mathematics algorithms are applied to knowledge sets and it does not essentially mean knowledge science is concerned. There are a bunch of business analytics tools which will perform these forms of functions mechanically, requiring few of the special skills concerned in knowledge science. True knowledge science involves additional custom writing and additional open-ended queries. Knowledge scientists usually do not begin to unravel a particular question, as most business analysts do. Rather, they're going to explore knowledge mistreatment advanced applied mathematics ways and permit the options within the knowledge to guide their analysis. 1.5 Basic Math in R In R programming you can perform basic math operations like most programming languages. Below are the functions: Arithmetic Operations: Operator Description + Addition - Subtraction * Multiplication / Division ^ Exponent %% Modulus %/% Integer Division 13 Statistics with R Programming Below is sample of arithmetic operations performed: > x <- 5 > y <- 16 > x+y [1] 21 > x-y [1] -11 > x*y [1] 80 > y/x [1] 3.2 > y%/%x [1] 3 > y%%x [1] 1 > y^x [1] 1048576 Relational Operators in R Relational operators are used to compare between values. Here is a list of relational operators available in R. 14 Statistics with R Programming 15 Statistics with R Programming Operator Description < Less Than > Greater Than <= Less Than Equal To >= Greater Than Equal To == Equal to != Not Equal to Below is sample of relational operations performed: > x <- 5 > y <- 16 > x<y [1] TRUE > x>y [1] FALSE > x<=5 [1] TRUE > y>=20 [1] FALSE > y == 16 [1] TRUE > x != 5 [1] FALSE 16 Statistics with R Programming 17 Statistics with R Programming Logical Operators in R All the logical operations in R language are performed using the below mentioned operators. Operator Description ! Logical NOT || Logical OR | Element-wise Logical OR && Logical AND & Element-wise logical AND Operators ‘&’ and ‘|’ perform element-wise operation generating result having length of the longer operand. And operators ‘&&’ and ‘||’ checks only the first element of the operands resulting into a single length logical vector. Zero is thought as a FALSE and non-zero numbers are taken as TRUE. A sample code: > x <- c(TRUE,FALSE,0,6) > y <- c(FALSE,TRUE,FALSE,TRUE) > !x [1] FALSE TRUE TRUE FALSE > x&y [1] FALSE FALSE FALSE TRUE > x&&y [1] FALSE > x|y [1] TRUE TRUE FALSE TRUE > x||y [1] TRUE 18 Statistics with R Programming 19 Statistics with R Programming Assignment Operators in R Moving on we have assignment operators which are used for assigning the values to variables. Operator Description <-, <<-, = Leftwards Assignment ->, ->> Rightwards Assignment The operators <- and = are often used, nearly interchangeably, to assign to variable within the same setting. The <<- operator is employed for distribution to variables within the parent environments (more like international assignments). The rightward assignments, though accessible are seldom used. > x <- 5 > x [1] 5 > x = 9 > x [1] 9 > 10 -> x > x [1] 10 20 Statistics with R Programming 1.5.1 Variables in R Variables are like placeholders for values in programming language. Well we will now discuss variables and constants in R. And will be able to learn best practices for using a variable in the program. A Variable Variables are used to store data, whose value can be changed according to our need. Unique name given to variable (function and objects as well) is identifier. Rules for writing Identifiers in R 1. Identifiers can be a combination of letters, digits, period (.) and underscore (_). 2. It must start with a letter or a period. If it starts with a period, it cannot be followed by a digit. 3. Reserved words in R cannot be used as identifiers. Valid Identifiers in R total, Sum, .fine.with.dot, this_is_variable, Number4 Invalid Identifiers in R tot@l, 5um, _not-a-variable, TRUE, .0ne Good Practices Earlier versions of R used underscore (_) as an assignment operator. So, the period (.) was used extensively in variable names having multiple words. 21 Statistics with R Programming Current versions of R support underscore as a valid identifier but it is good practice to use period as word separators. For example, a.variable.name is preferred over a_variable_name or alternatively we could use camel case as aVariableName Constants in R There are two types of constants in R programing language: Numeric Constants Character Constants The entities whose values cannot be altered or changed once defined are known as Constants. Basic types of constant are numeric constants and character constants. Numeric Constants All numbers fall under this category. They can be of type integer, double or complex. They can be also checked with the typeof() function. Numeric constants followed by L are regarded as integer and those followed by i are regarded as complex. > typeof(5) [1] "double" > typeof(5L) [1] "integer" > typeof(5i) [1] "complex" 22 Statistics with R Programming Numeric constants preceded by 0x or 0X are interpreted as hexadecimal numbers. > 0xff [1] 255 > 0XF + 1 [1] 16 Character Constants Character constants can be denoted using either by single quotes (') or double quotes (") as delimiters. > 'example' [1] "example" > typeof("5") [1] "character" Predefined Constants in R Some of the built-in constants defined in R along with their values are shown below. > LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" [20] "T" "U" "V" "W" "X" "Y" "Z" > letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" [20] "t" "u" "v" "w" "x" "y" "z" > pi [1] 3.141593 > month.name [1] "January" "February" "March" "April" [7] "July" "August" "September" "October" "May" "June" "November" "December" 23 Statistics with R Programming 1.6 Advanced Data Structures in R Data structure can be outlined as the specific form of organizing and storing the data. R programming supports five basic types of data structure namely vector, matrix, list, data frame and factor specific variety of organizing and storing the information. R programming supports five basic sorts of organization specifically vector, matrix, list, knowledge frame and issue. We will be discussing these knowledge structures and therefore the thanks to write these in R Programming. R has variety of data structures to dive into. Let us see some of them: Vector Matrices Data Frame List Factor Vector 24 Statistics with R Programming This data structures contain similar types of data, i.e., integer, double, logical, complex, etc. In order to create a vector in R Programming, c() function is used. They are useful and widely used. 25 Statistics with R Programming For example, > x <- 1:7; x[1] 1 2 3 4 5 6 7 > y <- 2:-2; y[1] 2 1 0 -1 -2 a <- c(1,2,5.3,6,-2,4) # numeric vector b <- c("one","two","three") # character vector c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector Matrices Matrix is a two-dimensional (2D) data structure that can be created using matrix () function in R programing language. The values for rows columns can be defined using nrow and ncol arguments. However providing both is not required as other dimension is automatically taken with the help of length of matrix. All columns in a matrix must have the same mode (numeric, character, etc.) and the same length. The general format is mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames)) byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the matrix should be filled by columns (the default). dimnames provides optional labels for the columns and rows. # generates 5 x 4 numeric matrix y<-matrix(1:20, nrow=5,ncol=4) # another example cells <- c(1,26,24,68) rnames <- c("R1", "R2") cnames <- c("C1", "C2") mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames, cnames)) 26 Statistics with R Programming 27 Statistics with R Programming Lists This data structure named List is list like structure virtually including data of different types. It is similar to vector but a vector contains similar data but list contains mixed data. A list is created using list(). An ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name. # example of a list with 4 components # a string, a numeric vector, a matrix, and a scaler w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3) # example of a list containing two lists v <- c(list1,list2) Data Frame This data structure is a special case of list where each component is of same length. Data frame is created using frame() function. For example: > x <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora")) > str(x) # structure of x 'data.frame': 2 obs. of 3 variables: $ SN : int 1 $ Age : num 21 2 15 $ Name: Factor w/ 2 levels "Dora","John": 2 1 28 Statistics with R Programming 29 Statistics with R Programming Factors Indicate R that a variable is nominal by declaring it as a factor. The factor stores the nominal values as a vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers. # variable gender with 20 "male" entries and # 30 "female" entries gender <- c(rep("male",20), rep("female", 30)) gender <- factor(gender) # stores gender as 20 1s and 30 2s and associates # 1=female, 2=male internally (alphabetically) # R now treats gender as a nominal variable summary(gender) 1.7 Understanding Business Analytics with R Today, it's imperative for each fashionable business to know the large amounts of knowledge it maintains on its customers and itself. By exploiting spreadsheets is futile and although SAS offers an answer, it's not the simplest one. The R programing language is associate degree open supply programing language that has been wide utilized by scientists across the planet. It’s a language that may conjointly facilitate businesses analyze large amounts of knowledge simply and effectively. The R programing language makes it straightforward for a business to travel through the business’s entire information. What the language will is it scales the knowledge in order that completely different and parallel processors will work upon the knowledge at the same time. Once employing a regular R package, most computers don't usually have spare memory to handle high amounts of information. However, the R programing language 30 Statistics with R Programming offers pulse counter, which is able to repurpose the knowledge into smaller chunks in order that the knowledge can then be processed on varied servers at constant time. In different words, pulse counter makes it straightforward to divide an enormous info across completely different nodes. This permits users of the programing language to form analyses of applied mathematics info in an exceedingly very subtle manner. Moreover, the language additionally makes it potential for programmers to simply perform periodic checks on the knowledge because it is being processed. This advantages business as a result of they will use high amounts of information and fine-tune it to try to a lot of subtle analyses. To demonstrate to you the intensity of the R programming dialect let us take a basic precedent. A business with thirty million columns of data for sixty distinct factors would now be able to be broke down in only ten minutes with the assistance of an R bundle. SAS can't contend with this, which is the reason such a large number of organizations and organizations are getting to be enthusiasts of R. Furthermore, R additionally furnishes organizations with the best perspective of the sort of data that it is managing. The pleasant thing about utilizing R is that organizations don't require creating modified devices and they additionally don't have to compose a great deal of code. R enables the business to effortlessly and rapidly spotless and examines the data that it needs to dissect. The decent thing about this dialect is that it enables the business to investigate new data with the assistance of redid representations. The straightforward truth is that R is extremely solid with regards to perceptions and illustrations. It enables organizations to effortlessly make the most engaging designs, which is something that SAS just can't do. Truth be told, SAS's illustrations are extremely appalling and horrendous. 31 Statistics with R Programming 1.8 Comparison of R with other Software for Analytics There has been already a huge debate on this topic about comparison of R programming with SAS or Python for Data Analytics. Firstly SAS and R both are in argument for Data Science. Let us see some comparative factors between them. SAS SAS (Statistical Analysis Software) is a product suite that can mine, adjust, oversee and recover information from an assortment of sources and perform measurable investigation on it. SAS has turned into the undisputed market pioneer in business examination space. The program offers colossal assortment of factual capacities has an incredible GUI (Enterprise Guide and Miner) for people to catch on quickly and supplies specialized help that is brilliant. By the by, it winds up being the most astounding evaluated elective and isn't continually advanced with most recent factual capacities. R R is the Open source partner of SAS, which has been connected in scientists and teachers. Because of its own open source nature, most recent systems get discharged promptly. There's a lot of documentation available over the web and it's an exceptionally savvy elective. Fundamentally R is a programming dialect and free programming condition for measurable figuring and designs bolstered by the R Foundation for Statistical Computing. 1. Accessibility/Price: 32 Statistics with R Programming SAS is business programming. It's not modest and still distant for most of the experts (in individual capacity). In any case, it holds the biggest piece of the pie in Private Organizations. In this way, except if you're in an Organization that has put resources into SAS and until, it may not be anything but difficult to achieve one. Or then again, R is free and might be downloaded by anybody. 2. Simplicity of learning: SAS isn't hard to learn and supplies straightforward decision (PROC SQL) for people who as of now comprehend SQL. Something else, it's an extraordinary secure GUI interface in its vault. As to assets, there are instructional exercises accessible on locales of various colleges and SAS has an entire guidance manual. Individuals include some significant pitfalls, in spite of the fact that there are affirmations from SAS preparing establishments. 3. Advancement in application: Each of the both these communities have the entire fundamental principal and most required capacities open. This trait just issues on the off chance that you happen to chip away at calculations and most recent advancements. Because of their open nature, R gets the most recent highlights quick (R contrasted with Python). SAS, then again redesigns its capacities in new variety rollouts. Advancement of new methods is speedy, since R has been utilized broadly in teachers before. SAS discharge overhauls in overseeing condition, in this way they're very much examined. R on the other side, there is open doors for mix-ups in the most recent advancements and has opened commitments. 33 Statistics with R Programming 4. What is the contention? It is truly not unreasonably simple, Linux may do everything Windows and that's just the beginning, however Windows still guidelines. Among the best explanations behind proceeded with Windows predominance is a simpler client experience and catalyst. Despite the majority of the focal points Linux offers (better security, no infections, and comparable client encounter especially in the Ubuntu shapes), the regular man favors Windows, not to state Linux does not have an energetic help network and its resolute after. 5. Statistical Capacity: Different SAS projects and SAS Stat pack a solid power and cover basically the whole array of methods and measurable assessment. However since R is open source and people can present their specific projects/libraries, the latest bleeding edge procedures are constantly discharged in R. To date R has about 15,000 projects in the CRAN (Comprehensive R Archive Network - The site which keeps the R work) storehouse. Some of the latest systems like GLMET, ADABoost RF, are available to be utilized in R yet not in SAS. Numerous trial programs are additionally realistic in R. Truth be told, in most Kaggle contenders (which needs its very own site post), the victor (who are among the world's best information diggers) have almost constantly utilized R to develop their models. 6. Client Care Support and Network: 34 Statistics with R Programming R has the biggest online network yet no client care bolster. On the off chance that you have issue, you're without anyone else's input. You will get a lot of assistance. SAS then again has committed client benefit notwithstanding the network. In this way, on the off chance that you experience issues in some other particular difficulties or setup, you can contact them. Conclusion Absolutely, there's not really a winner in this race. It will be untimely to put down wagers on what will prevail, given the dynamic character of business. Controlled by your conditions (vocation stage, Financials and so forth) you can include your own weights and concoct what may be proper for you. Here are a few situations that are unique: In case you're a fresher entering in the examination division, we suggest you to learn SAS as your first dialect. It holds most prominent employment piece of the pie and is anything but difficult to learn. In case you're somebody who has invested energy in business, your ability should attempt and expand take in a fresh out of the box new apparatus. 35 Statistics with R Programming For experts and Masters in business, people should know something like 2 of these. That will include loads of adaptability for future and open up new possibilities. In case you're in a startup/outsourcing, R is less futile. 1.9 Installation of R R Language Project (https://www.r-project.org) is an ordinarily utilized free Statistics programming. R enables you to complete measurable examinations in an intelligent mode, and in addition permitting straightforward programming. To utilize R, you first need to introduce the R program on your PC. How will you check if R is installed on a Windows PC Before you introduce R on your PC, the principal activity is to check whether R is as of now introduced on your PC (for instance, by a past client). These directions will centre around introducing R on a Windows PC. In any case, we will likewise quickly notice how to introduce R on a Macintosh or Linux PC (see beneath). In the event that you are utilizing a Windows PC, there are two different ways you can check whether R is as of now introduced on your PC: Look up if there happens to be an “R” icon on the desktop-screen of your computer that you are using. If so, double-click on the “R” icon to begin R. If you cannot find an “R” icon, try this next step instead. Click on the “Start” menu at the bottom left of your Windows desktop, and then move your mouse over “All Programs” in the menu that pops up. See if “R” appears in the list of programs that pops up. If it does, it means that R is already installed on your computer, and you can start R by selecting “R” (or R X.X.X, where X.X.X gives the version of R, for instance, R 2.10.0) from the list. Assuming either (1) or (2) above succeeds in beginning R, it implies that R is as of now introduced on the PC that you are utilizing. (In the event that neither succeeds, R isn't introduced yet). On the off chance that there is an old variant of R introduced on the Windows PC that you are utilizing, it merits introducing the most recent form of R, to ensure that you have all the most recent R capacities accessible to you to utilize. 36 Statistics with R Programming To check what is latest version of R To discover what the most recent rendition of R is, you can take a gander at the CRAN (Comprehensive R Network) site, http://cran.r-project.org/. Close to "The most recent discharge" (about mostly down the page), it will state something like "R-X.X.X.tar.gz" (ex. "R-2.12.1.tar.gz"). This implies the most recent arrival of R is X.X.X (for instance, 2.12.1). New arrivals of ‘R’ are made frequently (around once every month), as R is effectively being enhanced constantly. It is beneficial putting in new forms of R routinely, to ensure that you have an ongoing rendition of R (to guarantee similarity with all the most recent adaptations of the R bundles that you have downloaded). Installation steps for R on Windows Operating System To install R consider the steps given below: 1. Open your browser and type the url: https://cran.r-project.org 2. Under “Download and Install R” select and click the windows link 3. Under "Subdirectories", tap on the "base" connect. 4. On the following page, you should see a connection saying something like "Download R 2.10.1 for Windows" (or R X.X.X, where X.X.X gives the adaptation of R, eg. R 2.11.1). Tap on this connection. 37 Statistics with R Programming 5. You might be inquired as to whether you need to spare or run a document "R2.10.1-win32.exe". Pick "Spare" and spare the document on the Desktop. At that point double tap on the symbol for the record to run it 6. You will be requested that what dialect introduce it in - pick English. 7. The R Setup Wizard will show up in a window. Snap "Next" at the base of the R Setup wizard window. 8. The following page says "Data" at the best. Snap "Next" once more. 9. The following page says "Data" at the best. Snap "Next" once more. 10. The following page says "Select Destination Location" at the best. As a matter of course, it will recommend to introduce R in "C:\Program Files" on your PC. 11. Snap "Next" at the base of the R Setup wizard window. 12. The following page says "Select segments" at the best. Snap "Next" once more. 13. The following page says "Startup choices" at the best. Snap "Next" once more. 38 Statistics with R Programming 14. The following page says "Select begin menu organizer" at the best. Snap "Next" once more. 15. The following page says "Select extra undertakings" at the best. Snap "Next" once more. 16. R should now be introduced. This will take about a moment. At the point when R has completed, you will see "Finishing the R for Windows Setup Wizard" show up. Snap "Wrap up". 17. To begin R, you can either pursue stage 18, or 19: 18. Check if there is a "R" symbol on the work area of the PC that you are utilizing. Provided that this is true, double tap on the "R" symbol to begin R. In the event that you can't discover a "R" symbol, attempt stage 19. 19. Tap on the "Begin" catch at the base left of your PC screen, and after that pick "All projects", and begin R by choosing "R" (or R X.X.X, where X.X.X gives the adaptation of R, eg. R 2.10.0) from the menu of projects. 20. The R support (a square shape) should open up: 39 Statistics with R Programming Getting RStudio RStudio is an IDE (Integrated Development Environment) that makes R less demanding to utilize and is more like SPSS or ‘Stata’. It incorporates a code proofreader, investigating and representation devices. If it's not too much trouble utilize it to acquire a pleasant R encounter. Head over to https://www.rstudio.com/products/rstudio/download/ and choose the ‘FREE’ version to download on your PC. RStudio is not completely FREE for its advanced features it prompts you to buy the subscription. After following the on-screen steps of Installing RStudio on Windows PC, boot up the RStudio from either by desktop icon or from programs menu. After booting up successfully RStudio you will see the following window: 40 Statistics with R Programming 41 Statistics with R Programming The Tidyverse The Tidyverse allows users access to set of packages that augment R capabilities and share an underlying design concept. An astounding model is dplyr, a bundle that truly rearranges information control. Similarly for instance it gives, among different capacities and abilities, group_by and abridge capacities to perform activities, for example, SUMIF or SUMIFS from Microsoft Excel. On the off chance that you need to make plots from R, the Tidyverse gives the ggplot2 bundle to plot creation. There are great instructional exercises to learn ggplot2. 42 Statistics with R Programming Another cool component is that the Tidyverse gives the haven package to import/trade information by utilizing SPSS, Stata, and SAS positions. The establishment guidelines are unique, contingent upon your working framework; Microsoft Windows, Mac OS X or Ubuntu Linux. 1.10 Using R Command Line Interface The R Console and other intelligent instruments like RStudio are incredible for prototyping code and investigating information, yet at some point or another we will need to utilize our program in a pipeline or run it in a shell content to process a huge number of information documents. With the end goal to do that, we have to make our projects work like other UNIX direction line instruments. For instance, we may need a program that peruses an informational collection and prints the normal aggravation per understanding: $ Rscript readings.R --mean data/inflammation-01.csv 15.45 15.425 16.1 ... 16.4 17.05 5.9 Be that as it may, we may likewise need to take a gander at the base of the initial four lines $ head -4 data/inflammation-01.csv | Rscript readings.R --min Or on the other hand the greatest irritations in a few documents in a steady progression: $ Rscript readings.R --max data/inflammation-*.csv 43 Statistics with R Programming Running R scripts from the order line can be a great method to: Mechanize your R scripts Incorporate R into creation Call R through different apparatuses or frameworks There are fundamentally two Linux directions that are utilized. The first is the order, Rscript, and is favored. The more seasoned order is R CMD BATCH. You can call these straightforwardly from the direction line or incorporate them into slam content. You can likewise call these from any activity scheduler. Note, these are R related instruments. The RStudio IDE does not as of now accompany apparatuses that improve or deal with the Rscript and R CMD BATCH capacities. Be that as it may, there is a shell incorporated with the IDE and you could possibly call these directions from that point. The option in contrast to the utilizing the Linux order line is to utilize the source() work within R. The source capacity will likewise call content, however you must be inside a R session to utilize it. Command-Line Arguments in R Using the text editor of your choice, save the following line of code in a text file called session-info.R: sessionInfo() The function, sessionInfo, yields the adaptation of R you are running and also the kind of PC you are utilizing (and additionally the variants of the bundles that have been stacked). This is extremely valuable data to incorporate when approaching others for help with your R code. Presently we can run the code in the document we made from the Unix Shell utilizing Rscript: R version 3.5.1 (2017-01-27) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.5 LTS Matrix products: default BLAS: /home/travis/R-bin/lib/R/lib/libRblas.so LAPACK: /home/travis/R-bin/lib/R/lib/libRlapack.so 44 Statistics with R Programming locale: [1] LC_CTYPE=en_US.UTF-8 [3] LC_TIME=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 [9] LC_ADDRESS=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_NUMERIC=C LC_COLLATE=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_NAME=C LC_TELEPHONE=C LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.5.1 1.11 Exploring and Learning RStudio What is RStudio? As you might know, RStudio is an Integrated Development Environment (IDE) that 45 Statistics with R Programming enables you to associate with R all the more promptly. RStudio is like the standard RGui, however is impressively more easy to use. It has more drop-down menus, windows with various tabs, and numerous customization choices. The first occasion when you open RStudio, you will see three windows. A forward window is covered up as a matter of course, yet can be opened by tapping the File drop-down menu, at that point New File, and afterward RScript. All wide explained information about data on utilizing RStudio can be found at RStudio's Website. RStudio Windows Console window Source Tabs Environment Tab History Tab Files Tab Plots Tab Packages Tab Help Tab Positio Description n Lowerleft Upperleft location were commands are entered and the output is printed Built-in Text Editor Upper- Interactive list of loaded R right objects Upper- List of key strokes entered into right console Lower- File explorer to navigate C right drive folders Lowerright Lowerright Output location for plots List of installed packages Lower- Output location for help right commands and help search 46 Statistics with R Programming window Viewer Tab Lower- Advanced tab for local web right content Fundamental Hints for using R 1. R is a Command-Line or Terminal based. It expects you to sort or reorder directions after an order provoke (>) that shows up when you open R. In the wake of composing an order in the R comfort and squeezing Enter on your console, the direction will run. On the off chance that your order isn't finished, R issues a continuation provoke (shown by plus sign: +). Then again you can compose content in the content window, and select a direction, and tap the Run catch. 2. One should know that like most languages R is too case Sensitive, meaning you variable named aVar and avar are two different variables. 3. Commands in ‘R’ are likewise called capacities. The essential configuration of a capacity in R is: function.name(argument, options). 4. The up arrow (^) on your keyboard can be used to bring up previous commands that you’ve typed in the R console. 5. The $ symbol is used to select a particular column within the table (e.g., table$column). 6. Any text that you do not want R to act on (such as comments, notes, or instructions) needs to be preceded by the # symbol (a.k.a. hash-tag, comment, pound, or number symbol). R ignores the remainder of the script line following #. For instance: plot(x, y) # This text will not affect the plot function because of the comment 47 Statistics with R Programming 1.11.1 Data Management in RStudio Before you start working in R, you should set your working registry (an organizer to hold the majority of your undertaking documents); for instance, "C:\workspace\… ". This registry is where all your info informational collections are being put away. It additionally fills in as the default area for plots and different articles traded from R. Whenever set, it conveniently enables you to import information into R with only a record name, not the whole document way. Typically toward the start of every R session you should set your working catalog. To change the working catalog in RStudio, select the Files Tab > More > Set as Working Directory. This can be achieved in the Console by typing: > setwd("C:/workspace") # beware R uses forward slashes / instead of back slashes \ for file path names To check the file path of the current working directory (which should now be “C:\workspace2”), type: getwd() 48 Statistics with R Programming 1.11.2 Importing Data in RStudio After your working index is set, you can import information from .csv, .txt, and so forth. One fundamental direction for bringing in information into R is read.csv(). The direction is trailed by the record name and after that some discretionary guidelines for how to peruse the document. To begin with, make a model record by copying the contents beneath. Paste the content into Notepad and save the document as sand_example.csv in your C:\workspace organizer. location,landuse,horizon,depth,sand city,crop,A,14,19 city,crop,B,25,21 city,pasture,A,10,23 city,pasture,B,27,34 city,range,A,15,22 city,range,B,23,23 farm,crop,A,12,31 farm,crop,B,31,35 farm,pasture,A,17,30 farm,pasture,B,26,36 farm,range,A,15,25 farm,range,B,24,29 west,crop,A,13,27 west,crop,B,29,25 west,pasture,A,11,21 west,pasture,B,31,26 west,range,A,14,23 west,range,B,24,24 This dataset can either be foreign into R utilizing the Import Dataset catch from the Environment tab, or by composing the accompanying direction into the R comfort: sand <- read.csv("C:/workspace/sand_example.csv") # if your workspace was already set you could simply use the filename, like so # sand <-read.csv("sand_example.csv") 49 Statistics with R Programming 50 Statistics with R Programming 1.11.3 Exporting, Viewing and Removing Data To export data from RStudio, use the command write.csv() function. Since we have already set our working directory, R automatically saves our file into the working directory. write.csv(sand, file = "sand_example2.csv") # or use the write.table() function to export other text file types When the document is foreign made, it is basic that you check to guarantee that R accurately imported your information. Ensure numerical information are effectively transported in as numerical, that your segment headings are saved, and so on. To see information basically tap on the sand dataset recorded in the Environment tab. This will open up a different window that shows a spreadsheet like view. Moreover one can use the following functions to view your data in R. Function Description print() prints all the object(avoid large tables) head() prints the first 6 lines of your data str() names() ls() shows the data structure of an r object lists the column names(i.e. headers) of data lists all the r objects in your workspace directory 51 Statistics with R Programming 1.12 Using Help Feature in R R has broad documentation, various mailing records, and incalculable books (huge numbers of which are free and recorded at end of every section for this course). The implicit help records are at times obscure, and the online answers can be brisk, however on the off chance that you look for help you will discover it. To take in more about the capacity you are utilizing and the choices and contentions accessible, figure out how to help yourself by exploiting a portion of the accompanying help works in RStudio: Utilize the Help tab in the lower-right Window to seek directions, (for example, hist) or points, (for example, histogram). You can use the Help tab in the lower-right Window Tab to search commands (such as hist) or topics (such as histogram). Sort help (read.csv) or ‘?read.csv ’ in the Console window to raise an assistance page. Results will show up in the Help tab in the lower-right hand window. Certain capacities may require citations, for example, help ("+"). 52 Statistics with R Programming # Help file for a function help(read.csv) # or ?read.csv # Help files for a package help(package = "soiltexture") 1.13 Chapter Summary As for the career, multiple ways accessible in huge information keep on developing so does the lack of huge information experts expected to fill those positions. In the past segments of this section the qualities should have been effective in the field of huge information have been presented and clarified. The qualities, for example, correspondence, information of huge information ideas, and readiness are similarly as critical as the specialized expertise parts of enormous information. And speaking of R, it is a tool which is free and open-source, making it feasible for anybody to approach world-class measurable investigation instruments. Learning R isn't simple — on the off chance that it was, information researchers wouldn't be in such popularity. Be that as it may, there is no deficiency of value assets you can use to learn R in case you're willing to invest the energy and exertion. 1.14 Online Resources ¹http://cran.r-project.org/ The Comprehensive R Archive Network ²http://cran.r-project.org/doc/manuals/r-release/R-intro.html The R manual: Introduction for R 53 Statistics with R Programming ³http://cran.r-project.org/doc/manuals/r-release/R-data.html This is a guide to importing and exporting data to and from R. ⁴http://cran.r-project.org/doc/manuals/r-release/R-exts.html This is a guide to extending R, describing the process of creating R add-on packages, writing R documentation, R’s system and foreign language interfaces, and the R API. ⁵http://cran.r-project.org/doc/manuals/r-release/R-admin.html This is a guide to installation and administration for R. ⁶http://cran.r-project.org/doc/manuals/r-release/R-ints.html This is a guide to the internal structures of R and coding standards for the core team working on R itself. ⁷http://cran.r-project.org/doc/manuals/r-release/R-lang.html This is an introduction to the R language, explaining evaluation, parsing, object oriented programming, computing on the language, and so forth. 1.15 Exercises Explain Data Analytics and Sources of Big Data. How to import Data in R and explain steps for data cleansing. What are Data Structures used in R. How to Quit R Session? 54 Statistics with R Programming What is R Script Editor and enlist few IDEs for using R. Compare R and Python programming languages for Predictive Modeling. The iris dataset has different species of flowers such as Setosa, Versicolor and Virginica with their sepal length. Now, we want to understand the distribution of sepal length across all the species of flowers. One way to do this is to visualize this relation through the graph shown below. A) Which function cab be used to produce the graph shown below? i) xyplot() ii) stripplot() iii) barchart() iv) bwplot() Below is the given sample function. Consider it and answer the following question. f <- function(x) { g <- function(y) { y + z } z <- 4 x + g(x) } 55 Statistics with R Programming A) If we execute following commands (written below), what will be the output? i) ii) iii) iv) 12 7 4 16 56 CHAPTER 02 Introduction to R Programming LEARNINGOBJECTIVES In this chapter you will learn: About Variables and Data types used in R How Vectors and Matrices are defined in R About Control Statements and loops How to use cbind and rbind function 2.1 Introduction to R Programming R language is an open source program maintained by the R core-development team. It is a team of volunteer developers from across the globe. R language is massively used for performing statistical operations. It is available on the R-Project website which is hosted at https://www.r-project.org/ . R actually is a derived tool of S dialect that was created at AT&T Bell Laboratories by Rick Becker, John Chambers and Allan Wilks. Forms of ‘R’ are accessible, at no expense, for 32-bit variants of Microsoft Windows for Linux, for UNIX and for Macintosh OS X. (There are more established variants of R that help 8.6 and 9.) It is accessible through the Comprehensive R Archive Network (CRAN). Statistics with R Programming The reference for John Chambers' 1998 Association for Computing Machinery Software grant expressed that S has everlastingly changed how individuals break down, imagine and control information." The R venture develops the thoughts and bits of knowledge that created the S dialect. Here are focuses that potential clients may note: R has broad and great designs capacities that are firmly connected with its diagnostic capacities. The R framework is growing quickly. New highlights and capacities seem at regular intervals. Basic figuring and examinations can be dealt with clearly. In the event that basic techniques demonstrate lacking, there can be plan of action to the enormous scope of further developed capacities that R offers. Adjustment of accessible capacities permits considerably more prominent adaptability. Since R is free, clients have no privilege to expect consideration, on the R-encourage list or somewhere else, to questions. Be thankful for whatever assistance is given. Clients who need a point and snap interface ought to examine the R Commander (Rcmdr bundle) interface. While R is as dependable as any measurable programming that is accessible, and presented to higher guidelines of examination than most different frameworks, there are traps that call for unique consideration. A portion of the model fitting schedules are driving edge, with a constrained custom of experience of the impediments and entanglements. Whatever the factual framework, and particularly when there is some component of entanglement, check each progression with consideration. The abilities required for the registering are not without anyone else enough. Neither R nor some other measurable framework will give the measurable mastery expected to utilize advanced capacities, or to know when gullible strategies are lacking. Anybody with an opposite view may mind to think about whether a butcher's meat-cutting aptitudes are prone to be sufficient for successful creature (or perhaps human!) medical procedure. 58 Statistics with R Programming Involvement with the utilization of R is be that as it may, more than with most frameworks, prone to be an instructive affair. Hurrah for the R advancement group! Why learn R Programming R programming dialect is best for measurable, information investigation and machine learning. By utilizing this dialect we can make questions, capacities, and bundles. We can utilize it anyplace. It's stage autonomous, so we can apply it to all working framework. It's free, so anybody can introduce it in any association without acquiring a permit. 2.2 Variables in R Variables are like placeholders for values in programming language. Well we will now discuss variables and constants in R. And will be able to learn best practices for using a variable in the program. 2.2.1 What is Variable? Variables are used to store data, whose value can be changed according to our need. Unique name given to variable (function and objects as well) is identifier. Rules for writing Identifiers in R 4. Identifiers can be a combination of letters, digits, period (.) and underscore (_). 5. It must start with a letter or a period. If it starts with a period, it cannot be followed by a digit. 6. Reserved words in R cannot be used as identifiers. Valid Identifiers in R total, Sum, .fine.with.dot, this_is_variable, Number4 59 Statistics with R Programming Invalid Identifiers in R tot@l, 5um, _not-a-variable, TRUE, .0ne 2.2.2 Assigning values to Variables We assign out qualities to factors with the task administrator "=". Simply composing the variable without anyone else at the prompt will print out the esteem. We should take note of that another type of task administrator "<-" is likewise being used. >x=1 >x [1] 1 2.2.3 Good Practices Earlier versions of R used underscore (_) as an assignment operator. So, the period (.) was used extensively in variable names having multiple words. Current versions of R support underscore as a valid identifier but it is good practice to use period as word separators. For example, a.variable.name is preferred over a_variable_name or alternatively we could use camel case as aVariableName 2.2.4 Creating new Variables By using the assigning operator <- , one can assign values to variables in R or creating variables. New variable can also be just declared without being assigned any value and could be assigned a value later. 60 Statistics with R Programming # example of creating new variables adata$sum <- adata$x1 + adata$x2 adata$mean <- (adata$x1 + adata$x2) /2 adata <- transform( adata, sum = x1 + x2, mean = (x1 + x2)/2 ) Can we Rename Variables Yes you are able to rename variables with programmatically or interactively. # rename interactively fix(mydata) # results are saved on close # rename programmatically library(reshape) mydata <- rename(mydata, c(oldname="newname")) 2.3 Scalars in R R supports wider list of data types and objects including Scalars, Lists, Matrices, Data Frames and Vectors. Information is the most fundamental fixings utilized in "information examination". R underpins a wide assortment of information composes including scalars, 61 Statistics with R Programming vectors, lattices, information casings, and records. In this section, one can go over some generally utilized information composes and quickly cover questions at last. In this era of programming, scalar alludes to a nuclear amount that can hold just a single an incentive at any given moment. Scalars are the most fundamental information composes that can be utilized to build more unpredictable ones. We should investigate some normal sorts of scalars with basic R directions. Number > x <- 2 > y <- 1.5 > class(x) [1] "numeric" > class(y) [1] "numeric" > class(y+x) [1] "numeric" Logical Value > m <- x > y > n <- x < y >m [1] FALSE >n [1] TRUE > class(m) [1] "logical" > class(NA) Values [1] "logical" # Is x larger than y? # Is x smaller than y? # NA is another logical value: 'Not Available'/Missing 62 Statistics with R Programming There are few more logical operators you may want to try. >m&n # AND [1] FALSE >m|n # OR [1] TRUE > !m # Negation [1] TRUE Character (string) > a <- "1"; b <- "2.5" # Are they different from x and y we used earlier? > a;b [1] "2" [1] "3.5" > a+b # a+b=5.5? Error in a + b : non-numeric argument to binary operator > class(a) [1] "character" > class(as.numeric(a)) # but you can coerce this character into a number [1] "numeric" > class(as.character(x)) # vice resa [1] "character" 63 Statistics with R Programming 2.4 Data Types in R Data structure can be outlined as the specific form of organizing and storing the data. R programming supports five basic types of data structure namely vector, matrix, list, data frame and factor specific variety of organizing and storing the information. R programming supports five basic sorts of organization specifically vector, matrix, list, knowledge frame and issue. We will be discussing these knowledge structures and therefore the thanks to write these in R Programming. R has variety of data structures to dive into. Let us see some of them: 2.4.1 Vectors Vectors are basic R type. A vector is a grouping of information components of a similar essential compose. Members in a vector are authoritatively called components. By and by, we will simply call them members in this site. 64 Statistics with R Programming Here is a vector containing three numeric qualities 2, 3 and 5. > c(2, 3, 5) [1] 2 3 5 Vector Logical Values are: > c(TRUE, FALSE, TRUE, FALSE, FALSE) [1] TRUE FALSE TRUE FALSE FALSE String can be included in Vector > c("aa", "bb", "cc", "dd", "ee") [1] "aa" "bb" "cc" "dd" "ee" Unexpectedly, the quantity of members in a vector is given by the length work. > length(c("aa", "bb", "cc", "dd", "ee")) [1] 5 Combining Vectors Vectors can be added together by the function c. For instance, the below two vectors n and s are combined into a new vector containing elements from both vectors. > n = c(2, 3, 5) > p = c("aa", "bb", "cc", "dd", "ee") > c(n, p) 65 Statistics with R Programming [1] "2" "3" "5" "aa" "bb" "cc" "dd" "ee" Number arithmetic tasks of vectors are performed part by-part, i.e., member wise. For instance, assume we have two vectors namely a and b. > a = c(1, 3, 5, 7) > b = c(1, 2, 4, 8) After that, if we make product by 5, we would get a vector with each of its members multiplied by 5. >5*a [1] 5 15 25 35 What's more, on the off chance that we gather an and b into a single unit, the aggregate would be a vector whose individuals are the entirety of the comparing individuals from an and b. >a+b [1] 2 5 9 15 2.4.2 Matrices Matrix is a two-dimensional (2D) data structure that can be created using matrix () function in R programing language. The values for rows columns can be defined using nrow and ncol arguments. However providing both is not required as other dimension is 66 Statistics with R Programming automatically taken with the help of length of matrix. All columns in a matrix must have the same mode (numeric, character, etc.) and the same length. The general format is mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE, dimnames=list(char_vector_rownames, char_vector_colnames)) byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates that the matrix should be filled by columns (the default). dimnames provides optional labels for the columns and rows. # generates 5 x 4 numeric matrix y<-matrix(1:20, nrow=5,ncol=4) # another example cells <- c(1,26,24,68) rnames <- c("R1", "R2") cnames <- c("C1", "C2") mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames, cnames)) Constructing a Matrix in R There are different approaches to develop a grid. When we develop a framework specifically with information components, the network content is filled along the segment introduction naturally. For instance, in the accompanying code bit, the substance of B is filled along the sections successively. 67 Statistics with R Programming > B = matrix( + c(2, 4, 3, 1, 5, 7), + nrow=3, + ncol=2) > B # B has 3 rows and 2 columns [1,] [2,] [3,] [,1] [,2] 2 1 4 5 3 7 Transpose One can set up a transpose of a matrix by exchanging the position of its column and rows with the function t. > t(B) # transpose of B [,1] [,2] [,3] [1,] 2 4 3 [2,] 1 5 7 Merging of Matrices The segments of two lattices having a similar number of lines can be joined into a bigger grid. For instance, assume we have another framework C additionally with 3 columns. > D = matrix( + c(7, 4, 2), + nrow=3, 68 Statistics with R Programming + ncol=1) >D # D has 3 rows [,1] [1,] 7 [2,] 4 [3,] 2 Then we can combine the columns of B and C with cbind. > cbind(B, C) [,1] [,2] [,3] [1,] 2 1 7 [2,] 4 5 4 [3,] 3 7 2 Correspondingly, we can join the lines of two networks in the event that they have a similar number of sections with the rbind work. > E = matrix( + c(6, 1), + nrow=1, + ncol=2) >E # E has 2 columns 69 Statistics with R Programming [,1] [,2] [1,] 6 2 > rbind(B, D) [,1] [,2] [1,] 2 1 [2,] 4 5 [3,] 3 7 [4,] 6 2 Deconstruction Matrix can be deconstructed by applying the c function. > c(B) [1] 2 4 3 1 5 7 2.4.3 What is List in R This data structure named List is list like structure virtually including data of different types. It is similar to vector but a vector contains similar data but list contains mixed data. A list is created using list(). An ordered collection of objects (components) is a list that allows you to gather a variety of (possibly unrelated) objects under one name. # example of a list with 4 components # a string, a numeric vector, a matrix, and a scaler w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3) # example of a list containing two lists v <- c(list1,list2) 70 Statistics with R Programming A list is a conventional vector containing different items. For instance, the accompanying variable x is a list containing duplicates of three vectors n, s, b, and a numeric esteem 3. > n = c(2, 3, 5) > s = c("aa", "bb", "cc", "dd", "ee") > b = c(TRUE, FALSE, TRUE, FALSE, FALSE) > x = list(n, s, b, 3) # x contains copies of n, s, b Slicing List in R We recover a list slice with the single square section "[]" administrator. Coming up next is a slice containing the second individual from x, which is a duplicate of s. > x[2] [[1]] [1] "aa" "bb" "cc" "dd" "ee" With an index vector, we can recover a cut with various individuals. Here a cut containing the second and fourth individuals from x. 71 Statistics with R Programming > x[c(2, 4)] [[1]] [1] "aa" "bb" "cc" "dd" "ee" [[2]] [1] 3 Reference of Members With the end goal to reference a list member straightforwardly, we need to utilize the twofold square section "[[]]" administrator. The accompanying item x[[2]] is the second member of x. At the end of the day, x[[2]] is a duplicate of s, however isn't a cut containing s or its duplicate. > x[[2]] [1] "aa" "bb" "cc" "dd" "ee" > x[[2]][1] = "ta" [1] "ta" "bb" "cc" "dd" "ee" >s [1] "aa" "bb" "cc" "dd" "ee" # s is unaffected 2.4.4 Data Frames 72 Statistics with R Programming This data structure is a special case of list where each component is of same length. Data frame is created using frame() function. For instance consider below: > x <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora")) > str(x) # structure of x 'data.frame': 2 obs. of 3 variables: $ SN : int 1 $ Age : num 21 2 15 $ Name: Factor w/ 2 levels "Dora","John": 2 1 A data frame is utilized for putting away data tables. It is a rundown of vectors of equivalent length. For instance, the accompanying variable df is a data frame containing three vectors n, s, b. > n = c(2, 3, 5) > s = c("aa", "bb", "cc") > b = c(TRUE, FALSE, TRUE) > df = data.frame(n, s, b) # df is a data frame Pre-defined Data Frame We can utilize to work in data frames in R for our instructional exercises. For instance, here is a worked in data frame in R, called mtcars > mtcars mpg cyl disp hp drat wt ... 73 Statistics with R Programming Mazda RX4 22.0 6 160 110 3.90 2.62 ... Mazda RX4 Wag 22.0 6 160 110 3.90 2.88 ... Datsun 710 21.8 4 108 93 3.85 2.32 ... The best line of the table, called the header, contains the section names. Every level line a short time later indicates a data push, which starts with the name of the column, and afterward pursued by the real data. Every datum individual from a line is known as a cell. To recover data in a cell, we would enter its line and section arranges in the single square section "[]" administrator. The two directions are isolated by a comma. At the end of the day, the directions starts with line position, at that point pursued by a comma, and finishes with the segment position. The request is imperative. Here is the cell esteem from the principal push, second section of mtcars. > mtcars[1, 2] [1] 6 Besides, we can utilize the line and segment names rather than the numeric directions. > mtcars["Mazda RX4", "cyl"] [1] 6 Data Frame Column Vector We reference a data frame segment with the twofold square section "[[]]" administrator. For instance, to recover the ninth segment vector of the inherent data set mtcars, we compose mtcars[[9]]. > mtcars[[9]] 74 Statistics with R Programming [1] 1 1 1 0 0 0 0 0 0 0 0 ... We may also retrieve the exact same column vector by its name. > mtcars[["am"]] [1] 1 1 1 0 0 0 0 0 0 0 0 ... 2.5 Arrays We use to call array in R Programming basically called the multi-dimensional Data structure. In this, information is put away as lattices, push, and also in sections. We can utilize framework level, push file, and section list to get to the lattice components. R Array They are the information objects which can store information in excess of two measurements. An array is made utilizing the array() work. We can utilize vectors as information. To make an array we can utilize this qualities in the diminish parameter. Information about vector functions is also necessary with the concept of vectors. So, to learn this you can follow the below-mentioned link: For Example: In this below example, we have created an R Array of two 3×3 matrices each with 3 rows and 3 columns. # create two vectors of different lengths. vector1 <- c(2,9,3) 75 Statistics with R Programming vector2 <- c(10,16,17,13,11,15) Further these vector have been taken into array. # take these vectors as input to the array. result <- array(c(vector1,vector2),dim = c(3,3,2)) print(result) When the above code is run or executed: > [1] [,1] [,2] [,3] [1,] 2 10 13 [2,] 9 16 11 [3,] 3 17 15 > [2] [,1] [,2] [,3] [1,] 2 10 13 [2,] 9 16 11 [3,] 3 17 15 Arrays are identical to matrices but can have more than two dimensions even more than 5 dimensions but usually those are complex to handle. 76 Statistics with R Programming 2.6 Classes and Objects In R Language there are three classes. We will now cover these three classes named S3, S4 and reference Class. We can do protest arranged programming in R. Truth be told, everything in R is a question. A protest is an information structure having a few traits and strategies which follow up on its qualities. Class is an outline for the question. We can consider class like an outline (model) of a house. It contains every one of the insights about the floors, entryways, windows and so on. In light of these depictions we construct the house. House is the protest. As, numerous houses can be produced using a portrayal, we can make numerous articles from a class. A protest is additionally called an example of a class and the way toward making this question is called instantiation. While most programming dialects have a solitary class framework, R has three class frameworks. To be specific, S3, S4 and all the more as of late Reference class frameworks. They have their very own highlights and eccentricities and picking one over the other involves inclination. Underneath, we give a short prologue to them. S3 Class S3 class is to some degree crude in nature. It does not have a formal definition and protest of this class can be made essentially by adding a class ascribe to it. This straightforwardness represents the way that it is broadly utilized in R programming dialect. Truth be told the vast majority of the R worked in classes are of this compose. See R programming S3 Class segment for further points of interest. > # create a list with required components > s <- list(name = "John", age = 21, GPA = 3.5) > # name the class appropriately > class(s) <- "student" The code snippet above creates a S3 class with the given list. 77 Statistics with R Programming S4 Class in R S4 class is an enhancement over the S3 class. They have a formally characterized structure which helps in making the object of a similar class look pretty much comparative. Class parts are appropriately characterized utilizing the setClass() capacity and articles are made utilizing the new() work. Not at all like S3 classes and questions which needs formal definition, we take a gander at S4 class which is stricter as in it has a formal definition and a uniform method to make objects. This adds security to our code and keeps us from coincidentally committing gullible errors. Snippet of S4 Class < setClass("student", slots=list(name="character", age="numeric", GPA="numeric")) Reference Class Reference classes were presented later, contrasted with the other two. It is more like the protest arranged programming we are accustomed to seeing in other significant programming dialects. Reference classes are fundamentally S4 classed with a domain added to it. Reference class in R writing computer programs is like the protest arranged programming we are accustomed to finding in like manner dialects like C++, Java, Python and so on. Not at all like S3 and S4 classes, have strategies had a place with class instead of nonexclusive capacities. Reference classes are inside actualized as S4 classes with a domain added to it. 78 Statistics with R Programming < setRefClass("student") Comparison of S3, S4 and Reference Classes Let us see some difference between S3, S4 and Reference Classes. S3 Class S4 Class Referene Class Needs formal definition Class characterized utilizing setClass() Class characterized utilizing setRefClass() Objects are made by setting the class property Objects are created using new() Objects are created using generator functions Attributes are accessed using $ Attributes are accessed using @ Attributes are accessed using $ Methods belong to generic function Methods belong to generic function Methods belong to the class Follows copy-on-modify semantics Follows copy-on-modify semantics Does not follow copy-on-modify semantics 2.7 R Programming Structures In this section we will figure out how to compose the most straightforward program of stating "Hi World" in R programming dialect. In this program we will utilize work print() to show this string. 79 Statistics with R Programming Normally the string will be shown with twofold statements. Anyway with the end goal to maintain a strategic distance from that put, quote=FALSE. R is a block-structure dialect in the way of the ALGOL-relative family, for example, C, C++, Python, Perl, et cetera. As you've seen, squares are depicted by props, however supports are discretionary if the square comprises of only a solitary proclamation. Explanations are isolated by newline characters or, alternatively, by semicolons. 2.7.1 R Control Statements These enable you to control the stream of execution of a content regularly within a capacity. Basic ones include: if, else for while repeat break return next 80 Statistics with R Programming These are not directly utilized but these are used while working with R intuitively but instead inside capacities. If if (condition) { # do something } else { # do something else } For A for loop deals with an iterative variable and allots progressive qualities till the finish of a grouping. for (i in 1:11) { print(i) } x <- c("apples", "oranges", "bananas", "strawberries") for (i in x) { print(x[i]) } for (i in 1:4) { print(x[i]) } for (i in seq(x)) { print(x[i]) } for (i in 1:4) print(x[i]) Nested For Loops 81 Statistics with R Programming m <- matrix(1:10, 2) for (i in seq(nrow(m))) { for (j in seq(ncol(m))) { print(m[i, j]) } } While Loop i <- 1 while (i < 5) { print(i) i <- i + 1 } Make assured that there is a way to exit out of a while loop. Break and Repeat repeat { # simulations; generate some value have an expectation if within some range, # then exit the loop if ((value - expectation) <= threshold) { break } } Next Statement for (i in 1:20) { if (i%%2 == 1) { next 82 Statistics with R Programming } else { print(i) } } This loop will just print even numbers and skirt odd numbers. Later we'll learn different capacities that will enable us to evade these sorts of moderate control streams however much as could reasonably be expected (for the most part the while and for loops). Return Statement A couple of times, one does require functions to do some processing and provide the output back to the result screen. This is accomplished with the return() statement in R. Syntax: return(expression) R return() Statement Example: check <- function(x) { if (x > 0) { result <- "Positive" } else if (x < 0) { result <- "Negative" } else { result <- "Zero" } return(result) } Below are given some sample run snippets > check(1) [1] "Positive" > check(-10) [1] "Negative" > check(0) [1] "Zero" 83 Statistics with R Programming 2.8 R Arithmetic Operators In R programming you can perform basic math operations like most programming languages. Below are the functions: The R Arithmetic operators incorporate operators like Arithmetic Addition, Subtraction, Division, Multiplication, Exponent, Integer Division and Modulus. Every one of these operators are twofold operators, which implies they work on two operands. Beneath table demonstrates all the Arithmetic Operators in R Programming dialect with precedents. Arithmetic Operations: Operator Description + Addition - Subtraction * Multiplication / Division ^ Exponent %% Modulus %/% Integer Division Below is sample of arithmetic operations performed: > x <- 5 > y <- 16 > x+y [1] 21 > x-y [1] -11 > x*y [1] 80 84 Statistics with R Programming Logical Operators in R All the logical operations in R language are performed using the below mentioned operators. Operator Description ! Logical NOT || Logical OR | Element-wise Logical OR 85 Statistics with R Programming && Logical AND & Element-wise logical AND Operators ‘&’ and ‘|’ perform element-wise operation generating result having length of the longer operand. And operators ‘&&’ and ‘||’ checks only the first element of the operands resulting into a single length logical vector. Zero is thought as a FALSE and non-zero numbers are taken as TRUE. A sample code: > x <- c(TRUE,FALSE,0,6) > y <- c(FALSE,TRUE,FALSE,TRUE) > !x [1] FALSE TRUE TRUE FALSE > x&y [1] FALSE FALSE FALSE TRUE > x&&y [1] FALSE > x|y 2.8.1 Values [1] TRUE TRUE FALSE TRUE > x||y For the default strategy, a network joining the ... arguments segment insightful or push shrewd. (Exception: if there are no sources of info or every one of the information sources [1] TRUE are NULL, the esteem is NULL.) The sort of a network result decided from the most astounding kind of any of the contributions to the order crude < intelligent < whole number < twofold < complex < character < list . For cbind (rbind) the segment (push) names are taken from the colnames (rownames) of the arguments if these are lattice like. Generally from the names of the arguments or where 86 Statistics with R Programming those are not provided and deparse.level > 0, by deparsing the articulations given, for deparse.level = 1 just if that gives a sensible name (an 'image', see is.symbol). For cbind push names are taken from the principal contention with fitting names: rownames for a lattice, or names for a vector of length the quantity of columns of the outcome. For rbind segment names are taken from the main contention with suitable names: colnames for a network, or names for a vector of length the quantity of segments of the outcome. 2.9 Functions in R One of the most effective ways in which to enhance your reach as an information soul is to write down functions. Functions permit you to change common tasks during a lot of powerful and general manner than copy-and-pasting. Writing a function has 3 huge blessings over victimization copy-and-paste: 4. You can give a function a memorable or describing name that makes your code easier to understand. 5. As the requirements may change, you only need to update the code in one place, instead of many. 6. You reduce the chance of making incidental errors when you copy and paste (i.e. updating a variable name in one place, but not in another). Writing smart functions could be a time period journey. Even once victimization R for several years I still learn new techniques and higher ways that of approaching recent issues. The goal of this chapter isn't to show you each abstruse detail of functions however to urge you started with some pragmatic recommendation that you simply will apply now. 87 Statistics with R Programming As well as sensible recommendation for writing functions, this chapter additionally offers you some suggestions for a way to vogue your code. Smart code vogue is like correct punctuation. You can manage without it but however it certain makes things easier to read. Such types of punctuation, there are several unit attainable variations. Here we have a tendency to gift the design we have a tendency to use in our code, however the foremost vital issue is to be consistent. Function Definition in R An R function is generated by using the keyword function. The basic syntax of an R function definition is as follows − function_name <- function(arg_1, arg_2, ...) { Function body } Components of Function in R There are different components in R Function 5. Name of the Function – This is the actual name of the function. It is stored in R environment as an object with this name. 6. Arguments − an argument or parameter is a placeholder for values to pass to function while calling the function. When a function is invoked, you pass a value to the argument. Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values. 88 Statistics with R Programming 7. Function Body − the function body contains a collection of statements that defines what the function does. 8. Return Value − the return value of a function is the last expression in the function body to be evaluated. R has many pre-built performing functions which can be directly invoked in the program without defining them first. We can also create and use our own functions referred as user defined or user declared functions. There are some General Functions in R builtins() # List all built-in functions options() # Set options to control how R computes & displays results abs(x) # The absolute value of "x" append() # Add elements to a vector c(x) # A generic function which combines its arguments cat(x) # Prints the arguments cbind() # Combine vectors by row/column (cf. "paste" in Unix) diff(x) # Returns suitably lagged and iterated differences gl() # Generate factors with the pattern of their levels grep() # Pattern matching identical() # Test if 2 objects are *exactly* equal jitter() # Add a small amount of noise to a numeric vector julian() # Return Julian date length(x) ls() # Return no. of elements in vector x # List objects in current environment mat.or.vec() # Create a matrix or vector paste(x) # Concatenate vectors after converting to character 89 Statistics with R Programming range(x) # Returns the minimum and maximum of x rep(1,5) # Repeat the number 1 five times rev(x) # List the elements of "x" in reverse order Then there are some Statistical Functions in R Statistical Functions help(package=stats) # List all stats functions ?Chisquare # Help on chi-squared distribution functions ?Poisson # Help on Poisson distribution functions help(package=survival) # Survival analysis cor.test() # Perform correlation test cumsum(); cumprod(); cummin(); cummax() # Cumuluative functions for vectors density(x) # Compute kernel density estimates ks.test() # Performs one or two sample Kolmogorov-Smirnov tests loess(), lowess() mad() # Scatter plot smoothing # Calculate median absolute deviation mean(x), weighted.mean(x), median(x), min(x), max(x), quantile(x) rnorm(), runif() # Generate random data with Gaussian/uniform distribution splinefun() # Perform spline interpolation smooth.spline() sd() summary(x) # Fits a cubic smoothing spline # Calculate standard deviation # Returns a summary of x: mean, min, max etc. t.test() # Student's t-test var() # Calculate variance sample() ecdf() # Random samples & permutations # Empirical Cumulative Distribution Function 90 Statistics with R Programming qqplot() # quantile-quantile plot 2.9.1 Does R have Pointers? No, Just like Java there are No Pointers in R. R does not have variables comparing to pointers or references like those of, say, the C dialect. This can make programming more troublesome at times. (As of this composition, the present rendition of R has an exploratory element called reference classes, which may diminish the trouble.) For instance, you can't compose a capacity that straightforwardly changes its contentions. In Python, for example, you can do this: >>> x = [13,5,12] >>> x.sort() >>> x [5, 12, 13] Here, the estimation of x, the contention to sort(), changed. By difference, here's the way it works in R: > x <- c(13,5,12) > sort(x) [1] 5 12 13 91 Statistics with R Programming >x [1] 13 5 12 R is a factual examination bundle dependent on composing short contents or projects (as opposed to being founded on GUIs like spread sheets or coordinated work process editors). R is definitely not a "customary" programming dialect. In each script factors give methods for getting to the information put away in memory. R does not give guide access to the PC's memory yet rather gives various specific information structures (objects). 2.9.2 Recursion R Recursion Function A function that calls itself is known as a recursive function and this system is known as recursion. This unique programming system can be utilized to take care of issues by breaking them into littler and less difficult sub-issues. A precedent can help elucidate this idea. Give us a chance to take the case of finding the factorial of a number. Factorial of a positive whole number is characterized as the result of the considerable number of whole numbers from 1 to that number. For instance, the factorial of 5 (indicated as 5!) will be 5! = 1*2*3*4*5 = 120 This issue of discovering factorial of 5 can be separated into a sub-issue of increasing the factorial of 4 with 5. 5! = 5*4! Or more simply, 92 Statistics with R Programming n! = n*(n-1)! Presently we can proceed with this until the point that we achieve 0! which is 1. The usage of this is given underneath. Consider an instance example below of Recursion used in R: # Recursive function to find factorial recursive.factorial <- function(x) { if (x == 0) return (1) else return (x * recursive.factorial(x-1)) } Here, we have a function which will call itself. Something like recursive.factorial(x) will turn into x * recursive.factorial(x) until x becomes equal to 0. At the point when x moves toward becoming 0, we return 1 since the factorial of 0 is 1. This is the ending condition and is vital. Without this the recursion won't end and proceed uncertainly (in principle). Here are some example function calls to our function. > recursive.factorial(0) [1] 1 > recursive.factorial(5) [1] 120 > recursive.factorial(7) [1] 5040 93 Statistics with R Programming The utilization of recursion, regularly, makes code shorter and looks clean. Be that as it may, it is here and there difficult to finish the code rationale. It may be difficult to think about an issue recursively. Recursive functions are additionally memory escalated, since it can result into a considerable measure of settled function calls. This must be remembered when utilizing it for taking care of enormous issues. 2.10 Use of c and cbind in R In R programming there are some in-built functions namely c, cbind, and rbind. Let’s take a closer look at those. C or c Function Combine Values into A Vector Or List This is a conventional function which joins its arguments. The default technique joins its arguments to shape a vector. All arguments are constrained to a typical sort which is the kind of the returned esteem, and all traits aside from names are expelled. Keyword – manip Use cbind(…, deparse.level = 1) rbind(…, deparse.level = 1) # S3 method for data.frame rbind(…, deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors()) Arguments 94 Statistics with R Programming (summed up) vectors or grids. These can be given as named arguments. Other R items might be pressured as proper, or S4 techniques might be utilized: see segments 'Points of interest' and 'Esteem'. (For the "data.frame" technique for cbind these can be further arguments to data.frame, for example, stringsAsFactors.) Simplified The functions cbind and rbind are S3 nonexclusive, with strategies for information outlines. The information outline strategy will be utilized if no less than one contention is an information outline and the rest are vectors or grids. There can be different techniques; specifically, there is one for time arrangement objects. See the segment on 'Dispatch' for how the technique to be utilized is chosen. On the off chance that a portion of the arguments are of a S4 class, i.e., isS4(.) is valid, S4 techniques are looked for likewise, and the concealed cbind/rbind functions from bundle strategies possibly called, which thus expand on cbind2 or rbind2, separately. All things considered, deparse.level is complied, also to the default technique. cbind: Combine R Objects by Rows or Columns Description Take a succession of vector, matrix or information outline arguments and consolidate by sections or columns, individually. These are nonexclusive functions with strategies for other R classes. Use cbind(..., deparse.level = 1) rbind(..., deparse.level = 1) ## S3 method for class 'data.frame' rbind(..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors()) 95 Statistics with R Programming Arguments ... deparse.level make.row.names stringsAsFactors (all summed up) vectors or networks. These can be given as named arguments. Other R items might be constrained as fitting, or S4 techniques might be utilized: see segments 'Points of interest' and 'Esteem'. (For the "data.frame" strategy for cbind these can be further arguments to data.frame, for example, stringsAsFactors.) integer controlling the construction of labels in the case of non-matrix-like arguments (for the default method): deparse.level = 0 constructs no labels; the default, deparse.level = 1 or 2 constructs labels from the argument names, see the ‘Value’ section below. (just for information outline technique:) coherent showing if one of a kind and substantial row.names ought to be developed from the arguments. logical, passed to as.data.frame; only has an effect when the ... arguments contain a (non-data.frame) character. 2.10.1 Rbind/ rbind Function rbind() function is responsible for merging vector, matrix or data frame by rows. rbind(x1,x2,….) x1,x2:vector, matrix, data frames dataset1.csv Subtype Gender Expression A m -0.54 A f -0.8 B f -1.03 C m -0.41 Gender Expression D m 3.22 D f 1.02 D f 0.21 D m -0.04 dataset2.csv Subtype 96 Statistics with R Programming D m 2.11 B m -1.21 A f -0.2 Read in the data from the file: >x <- read.csv("data1.csv",header=T,sep=",") >x2 <- read.csv("data2.csv",header=T,sep=",") >x3 <- rbind(x,x2) >x3 Subtype Gender Expression 1 A m -0.54 2 A f -0.80 3 B f -1.03 4 C m -0.41 5 D m 3.22 6 D f 1.02 7 D f 0.21 8 D m -0.04 9 D m 2.11 10 B m -1.21 11 A f -0.20 The segment of the two datasets must be same, generally the blend will be insignificant. 2.10.2 R attach() and detach() Functions The database is joined to the R seek way. This implies the database is sought by R while assessing a variable, so questions in the database can be gotten to by just giving their names. attach() function makes the data available to the R Search Path. attach(x) x: dataframe, matrix, list 97 Statistics with R Programming Below file has been utilized for ANOVA analysis: Subtype,Gender,Expression A,m,-0.54 A,m,-0.8 A,m,-1.03 A,m,-0.41 A,m,-1.31 A,f,-0.66 A,m,-0.43 A,m,1.01 A,f,-1.15 Let first read in the data from the file: >x <- read.csv("anova.csv",header=T,sep=",") There are three header variables namely Expression, Gender and Subtype. We can display the variables by: >x$Gender [1] m m m m m f m m f m m f m m m m f m m m m m m f m m m f m m m m f m m m m [38] m m m m m m m m m f m f m m m m m f m m f m m f m m m m f m m m m m m m m [75] m m f m m m m m f m m m m m m m m m f m m f m m f m f m m f m m f m m f m [112] m f m m f m m m f m m m f m f m f f f f f f m f m f f f m f f f f m f m f [149] m f f m f f f f f m f m f f m f f m f f m f f f m f f f m f f f m f f m f [186] f f m f f m f m m f m f m f f m f f f f f m f f m f f f m m m f m m m f f [223] f f f f f m m m f m f f m f f f m f f f m f f f f m f m f f f f m f f f m [260] f f m f f f f f f m f f m f f f f f f m f f Levels: f m We are not using the variable Gender in R Search Path: >gender Error: object 'Gender' not found After we apply attach() function to the object "x", "Gender" can be used internationally: >attach(x) >Gender [1] m m m m m f m m f m m f m m m m f m m m m m m f m m m f m m m m f m m m m [38] m m m m m m m m m f m f m m m m m f m m f m m f m m m m f m m m m m m m m [75] m m f m m m m m f m m m m m m m m m f m m f m m f m f m m f m m f m m f m [112] m f m m f m m m f m m m f m f m f f f f f f m f m f f f m f f f f m f m f [149] m f f m f f f f f m f m f f m f f m f f m f f f m f f f m f f f m f f m f [186] f f m f f m f m m f m f m f f m f f f f f m f f m f f f m m m f m m m f f [223] f f f f f m m m f m f f m f f f m f f f m f f f f m f m f f f f m f f f m 98 Statistics with R Programming [260] f f m f f f f f f m f f m f f f f f f m f f Levels: f m detach() function reverses the process: >detach(x) >Gender Error: object 'Gender' not found Factors in R Indicate R that a variable is nominal by declaring it as a factor. The factor stores the nominal values as a vector of integers in the range [ 1... k ] (where k is the number of unique values in the nominal variable), and an internal vector of character strings (the original values) mapped to these integers. # variable gender with 20 "male" entries and # 30 "female" entries gender <- c(rep("male",20), rep("female", 30)) gender <- factor(gender) # stores gender as 20 1s and 30 2s and associates # 1=female, 2=male internally (alphabetically) # R now treats gender as a nominal variable summary(gender) R factors variable is a vector of unmitigated information. factor() function makes a factor variable, and figures the unmitigated conveyance of a vector information. factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x)) x: a vector of data ... 99 Statistics with R Programming > v <- c(1,3,5,8,2,1,3,5,3,5) > is.factor(v) [1] FALSE It will calculate the categorical distribution: > factor(v) [1] 1 3 5 8 2 1 3 5 3 5 Levels: 1 2 3 5 8 > x <- factor(v) >x [1] 1 3 5 8 2 1 3 5 3 5 Levels: 1 2 3 5 8 > is.factor(x) [1] TRUE Select levels: > x <- factor(v, levels=c(2,1)) >x [1] 1 <NA> <NA> <NA> 2 1 <NA> <NA> <NA> <NA> Levels: 2 1 Change the level value: > levels(x) <- c("two","one") >x [1] one <NA> <NA> <NA> two one <NA> <NA> <NA> <NA> Levels: two one Transforming Factors Converting a factor from a number can cause problems: 100 Statistics with R Programming f <- factor(c(3.4, 1.2, 5)) as.numeric(f) [1] 2 1 3 This does not mean that it will behave as expected (and there isn’t warning). The recommended way is to use the integer vector to index the factor levels. 101 Statistics with R Programming 2.11 Chapter Summary Thus, in this R programming language instructional exercise, we have contemplated the prologue to R programming in detail. R is free and open-source, making it workable for anybody to approach world-class measurable examination instruments. It is utilized broadly in the scholarly world and the private area and is the most famous factual investigation programming dialect today. Learning R isn't simple — on the off chance that it was, information researchers wouldn't be in such popularity. Be that as it may, there is no deficiency of value assets you can use to learn R in case you're willing to invest the energy and exertion. Along these lines, it is obvious from the above data that R is more well-known and better choice as R underpins an alternate sort of programming dialects. Additionally, R is an Open source and has error less abilities and accessibility to different dialects. 2.12 Online Resources 1 https://data-flair.training/blogs/r-tutorial/ R tutorial | Introduction to R Programming – Features and Applications 2 http://cran.r-project.org/doc/FAQ/R-FAQ.html The CRAN Web site hosts several documents, bibliographic resources, and links to other sites. 3 http://www.R-project.org/mail.html There are four discussion lists on R, to subscribe, send a message, or read the archives 4 https://www.rstudio.com/online-learning/ A wealth of tutorials, articles, and examples exist to help you learn R and its extensions. 102 Statistics with R Programming 2.13 Exercises Explain how you can start the R commander GUI? What is procedural programming in R? What are statistical software and data analysis in R? Enlist the various steps involved in Analytics project in R? Explain Mean, Median and Mode in R. How do perform Mean, Median and Mode operations on statistical data in R? What is the recycling of elements in a vector? Give an example. Explain the following in terms of R: i) table ii) file iii) tree What are vectors in R? Construct the following vector in R: a) (0.130.21 , 0.160.24 , . . . , 0.1360.234) 22 23 2 25 b) (2, , ,..., ) 2 3 25 What are different data structures in R. Briefly explain each of the following: a) Vector b) List c) Matrix d) Data frames e) Arrays 103 Statistics with R Programming Consider the following matrix creation code. State whether it is True or False. > matrix(1:9, nrow = 3, ncol = 3) [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 > # same result is obtained by providing only one dimension > matrix(1:9, nrow = 3) [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 Write a code to access elements of matrix defined in above code. Match the following R Functions with their appropriate uses. Function Performs the following operation i) append() a) Combine vectors by row/column ii) cbind() b) set or query graphical parameters iii) example() c) plot matrix of scatter plots iv) pairs() d) add elements to a vector v) par() e) gives a demo() What are the different types of sorting algorithms available in R language? 104 Statistics with R Programming CHAPTER 03 Data Manipulation in R LEARNINGOBJECTIVES 3.1 String Manipulation String manipulation comprises a set of functions used to extract information from texts. In R we have some packages for string manipulation like stringr and stringi. Let us explore some functions in stringr package. strwrap: This function is used to wrap the string to format the paragraphs. library(stringr) # takes a string in variable string string <- "Los Angeles, officially the City of Los Angeles and often known by its initials L.A., is the second-most populous city in the United States (after New York City), the most populous city in California and the county seat of Los Angeles County. Situated in Southern California, Los Angeles is known for its Mediterranean climate, ethnic diversity, sprawling metropolis, and as a major center of the American entertainment industry." strwrap(string) 105 Statistics with R Programming [1] "Los Angeles, officially the City of Los Angeles and often known by its initials" [2] "L.A., is the second-most populous city in the United States (after New York City)," [3] "the most populous city in California and the county seat of Los Angeles County." [4] "Situated in Southern California, Los Angeles is known for its Mediterranean" [5] "climate, ethnic diversity, sprawling metropolis, and as a major center of the" [6] "American entertainment industry." str_len: It is used to count the number of characters in the string. str_len(string) [1] 429 str_replace: It is used to replace matched patterns in a string. str_replace(string, pattern, replacement) # string is the input vector # pattern specifies the patterns to check # replacement specifies a charactor vector of replacements str_sub: 106 Statistics with R Programming This function is used to extract substrings from a character vector. str_sub(string, start, end) # string is the input character vector # start gives the index of the first character to be replaced # end gives the index of the last character to be replaced str_split: This function is used to split the vector of strings following a delimiter. str_split(x,sep) # Input vector of strings # sep specifies the delimiter 3.2 Data sorting To sort a DataFrame in R, order( ) function can be used. By default, sorting using order() function is in ASCENDING order. Prepending the sorting variable by a minus sign (-) indicates sorting in DESCENDING order. Here are some examples. # sorting examples using the “temp” dataset attach(temp) # sort by id newdata <- temp[order(id),] 107 Statistics with R Programming # sort by id and name newdata <- temp[order(id, name),] # sort by id (ascending) and name (descending) newdata <- temp[order(id, -name),] detach(temp) 3.3 Dealing with missing values We may come across the situation when the data is incomplete. In that condition, incomplete data is represented as NA. To handle these incomplete data we use functions like na.omit() and complete.cases(). These functions return the rows of a data frame which is free from missing values. complete.cases(data) # return the rows that are complete na.omit(data) # returns rows free from missing information To check if a row of a data frame is complete or not, we use function is.na(). It returns a logical value i.e. TRUE or FALSE. 108 Statistics with R Programming is.na(x) # returns either TRUE or FALSE # here x is the object to which we want to check 3.4 Find and remove duplicates record R provides two base functions to find and remove duplicates from record: • duplicated(): To identify duplicate elements. • unique(): To extract unique elements. Example: Given the following vector: x <- c(2, 1, 5, 5, 4, 6) To find the position of duplicate elements in x, use this: duplicated(x) ## [1] FALSE FALSE TRUE TRUE FALSE FALSE Extract duplicate elements: x[duplicated(x)] 109 Statistics with R Programming ## [1] 5 3.5 Cleaning data R provides many useful commands to clean data. Some of them are listed below: ● sub(): replaces the first occurrence in DataFrame. ● gsub(): replaces all occurrences in DataFrame. 2.13.1.1 Quantitative Variables in Ranges ● cut(data$col, seq(0,100, by=10)): Breaks the data by the range it falls into. 110 Statistics with R Programming ● cut2(data$col, g=6): It returns a factor variable with 6 groups. ● cut2(data$col, m=25): It returns a factor variable with at least 25 observations in each group. 2.13.1.2 Manipulating Rows/Columns ● merge(): For combining data frames. ● sort(): It sorts an array. ● order(data$col, na.last=T): It returns indexes for the ordered row. ● data[order(data$col, na.last=T),]: It reorders the entire DataFrame based on columns. ● melt(): In the reshape2 package, it is used for reshaping data. ● rbind(): 111 Statistics with R Programming It adds more rows to a DataFrame. 3.6 Recoding data In R, it is possible to re-code an entire vector or array at once. For example, let’s create a vector that has missing values. A <- c(3, 2, NA, 5, 3, 7, NA, NA, 5, 2, 6) [1] 3 2 NA 5 3 7 NA NA 5 2 6 Some re-coding tasks are very complex, if you want to re-code a categorical variable. In such cases, you might want to re-code vector or an array with character/string to numeric. gender <- c("MALE","FEMALE","FEMALE","UNKNOWN","MALE") gender [1] "MALE" "FEMALE" "FEMALE" "UNKNOWN" "MALE" 3.7 Merging Data Frames 112 Statistics with R Programming 2.14 Adding Columns To merge two data frames horizontally, merge() function is used in R. In most cases, we join two data frames by one or more common key variables. # merge two data frames by ID total <- merge(data_Frame_A, data_frame_B, by="ID") # merge two data frames by ID and Country total <- merge(data_frame_A, data_frame_B, by=c("ID","name")) 2.15 Adding Rows To join two data frames vertically, in R we rbind() function is used. The two data frames must have the same variables, order may be different. total <- rbind(data_frame_A, data_frame_B) 3.8 Slicing of Data R has powerful indexing features for accessing object elements. The following code snippets demonstrate ways to keep or delete and observations and to take random samples from a dataset. 2.16 Selecting (Keeping) Variables 113 Statistics with R Programming # select variables v1, v2, v3 vars <- c("v1", "v2", "v3") newdata <- data[vars] # another method vars <- paste("v", 1:3, sep="") newdata <- data[vars] # select 1st and 5th thru 10th variables newdata <- data[c(1,5:10)] 3.9 Renaming Columns and Rows 2.17 Renaming Columns 2.18 To rename a column in a DataFrame colnames() function is used. #changes column name “old.name” to “new.name” colnames(data)[colnames(data)=="old.name"] <- “new.name" colnames() can also be used to remove column names. 114 Statistics with R Programming # we remove column names by making them NULL colnames(data) <- NULL 2.19 Renaming Rows To rename a row in a DataFrame rownames() function is used. # changes row name “old.name” to “new.name” rownames(data)[rownames(data)=="old.name"] <- “new.name" rownames() can also be used to remove column names. # we remove row names by making them NULL rownames(data) <- NULL 3.10 Adding and Replacing Columns to Data Frames To add a new column to a data frame we make use of $ operator as my.dataframe$new.col <- a.vector # my.dataframe is the name of the data frame # new.col is the name of the new column 115 Statistics with R Programming # a.vector is the input vector to the new column Alternatively, we can also add column in the following way my.dataframe[“new.col”] <- a.vector To replace a column follow these steps my.dataframe$prev.col <- NULL my.dataframe$new.col <- a.vector Or if we don’t want to change the column name My.dataframe$prev.col <- a.vector 3.11 Apply functions Apply Apply can be used to apply a function to a matrix. For example, let’s create a sample dataset: 116 Statistics with R Programming data <- matrix(c(1:10, 21:30), nrow = 5, ncol = 4) data [,1] [,2] [,3] [,4] [1,] 1 6 21 26 [2,] 2 7 22 27 [3,] 3 8 23 28 [4,] 4 9 24 29 [5,] 5 10 25 30 Now apply() function can be used to find the mean of each row as follows: apply(data, 1, mean) 13.5 14.5 15.5 16.5 17.5 The second parameter is the dimension. Here, 1 signifies rows and 2 signifies columns. If you want both, c(1, 2) can be used. lapply lapply is similar to apply, but it takes a list/array as an input, and returns a list/array as output. 117 Statistics with R Programming Let’s create a list: data <- list(x = 1:5, y = 6:10, z = 20:25) data $x 1 2 3 4 5 $y 6 7 8 9 10 $z 20 21 22 23 24 25 Now, apply can be used to apply a function to each element in the list. For example: lapply(data, FUN=median) $x [1] 3 $y [1] 8 $z [1] 13 118 Statistics with R Programming 3.10 Doing Math and Simulation in R, Math Function In R, more than just basic operators can be used. R comes with a huge set of mathematical functions. All these functions are vectorized, so you can use them on complete vectors. Function What It Does abs(x) Takes the absolute value of x log(x,base=y) Takes the logarithm of x with base y; if base is not specified, returns the natural logarithm exp(x) Returns the exponential of x sqrt(x) Returns the square root of x factorial(x) Returns the factorial of x (x!) choose(x,y) Returns the number of possible combinations when drawing y elements at a time from x possibilities 119 Statistics with R Programming In R, logarithm of the numbers from 1 to 3 can be calculated like this: > log(1:3) [1] 0.0000000 0.6931472 1.0986123 We calculate the logarithm of these numbers with any base. log with base 6 is given below: > log(1:3,base=6) [1] 0.0000000 0.3868528 0.6131472 sqrt() only works with real numbers in R. NaN is produced when -ve number is given as input: sqrt(-1) ## Warning in sqrt(-1): NaNs produced ## [1] NaN 3.12 Functions For Statistical Distribution Normal distribution There are four functions that can be used to generate the values associated with the normal distribution. we can get a full list of them and their options using the help command: > help(Normal) 120 Statistics with R Programming The first function is dnorm(). It returns the height of the probability distribution at each point of given values. If you only give the points it assumes you want to use a mean of zero and standard deviation of one. There are options to use different values for mean and standard deviation, though: > dnorm(0) [1] 0.3989423 > dnorm(0)*sqrt(2*pi) [1] 1 > dnorm(0,mean=4) [1] 0.0001338302 > dnorm(0, mean=4, sd=10) [1] 0.03682701 > v <- c(0, 1, 2) > dnorm(v) [1] 0.39894228 0.24197072 0.05399097 > x <- seq(-20, 20, by=.1) > y <- dnorm(x) > plot(x, y) > y <- dnorm(x, mean=2.5, sd=0.1) > plot(x,y) 121 Statistics with R Programming The second function is pnorm(). It computes the probability that a normally distributed random number will be less than that number given a list. This function also goes by the rather ominous title of the “Cumulative Distribution Function.” It accepts the same options as dnorm: > pnorm(0) [1] 0.5 > pnorm(1) [1] 0.8413447 > pnorm(0, mean=2) [1] 0.02275013 > pnorm(0, mean=2, sd=3) [1] 0.2524925 > v <- c(0, 1, 2) > pnorm(v) [1] 0.5000000 0.8413447 0.9772499 > x <- seq(-20, 20, by=.1) > y <- pnorm(x) > plot(x, y) > y <- pnorm(x, mean=3, sd=4) > plot(x, y) 122 Statistics with R Programming 3.13 Sorting, Linear Algebra Operation on Vectors and Matrices 123 Statistics with R Programming 2.20 Matrix facilities In the following examples, A and B are matrices and x and b are a vectors. Operator or Function Description A*B Element-wise multiplication A %*% B Matrix multiplication A %o% B Outer product. AB' crossprod(A,B) crossprod(A) A'B and A'A respectively. t(A) Transpose diag(x) Creates diagonal matrix with elements of x in the principal diagonal diag(A) Returns a vector containing the elements of the principal diagonal diag(k) If k is a scalar, this creates a k x k identity matrix. Go figure. solve(A, b) Returns vector x in the equation b = Ax (i.e., A-1b) solve(A) Inverse of A where A is a square matrix. 124 Statistics with R Programming ginv(A) Moore-Penrose Generalized Inverse of A. ginv(A) requires loading the MASS package. y<-eigen(A) y$val are the eigenvalues of A y$vec are the eigenvectors of A y<-svd(A) Single value decomposition of A. y$d = vector containing the singular values of A y$u = matrix with columns contain the left singular vectors of A y$v = matrix with columns contain the right singular vectors of A R <- chol(A) Choleski factorization of A. Returns the upper triangular factor, such that R'R = A. y <- qr(A) QR decomposition of A. y$qr has an upper triangle that contains the decomposition and a lower triangle that contains information on the Q decomposition. y$rank is the rank of A. y$qraux a vector which contains additional information on Q. y$pivot contains information on the pivoting strategy used. 125 Statistics with R Programming cbind(A,B,...) Combine matrices(vectors) horizontally. Returns a matrix. rbind(A,B,...) Combine matrices(vectors) vertically. Returns a matrix. rowMeans(A) Returns vector of row means. rowSums(A) Returns vector of row sums. colMeans(A) Returns vector of column means. colSums(A) Returns vector of column sums. 126 Statistics with R Programming CHAPTER 04 Data Import Techniques in R LEARNINGOBJECTIVES 4.1 Installing a Package To install a package install.packages() function is used. install.packages(“Package Name”) 4.2 Activating a Package Activating a package is very easy in R. There are two ways to activate a package. First method is activating packages directly from package menu. Second method is using the library() function to activate a package. library(“Package Name”) To open the documentation of a particular package, use help parameter in the library() 127 Statistics with R Programming library(help = “Package Name”) 4.3 Built-In datasets To access the built-in datasets we use data functions as data(package = package.name) # package.name is the name of the package from where we want to import our dataset 4.4 Reading Files There are so many standards for storing data. Common formats are CSV, JSON, XML, YAML etc. Importing data in R is very simple. For Stata and Systat, foreign package is used. Example of importing data are provided below. From A Comma Delimited Text File To import data from CSV file read.table and read.csv functions are used. 128 Statistics with R Programming # header is the first row that contains variable names # assign the variable id to row names mydata <- read.table("data.csv", header=TRUE, sep=",", row.names="id") There is another way of reading a CSV file. mydata <- read.csv(file = “data.csv”) From An Excel File To import data from excel file we use read.xlsx function. library(xlsx) mydata <- read.xlsx(file, sheetIndex, header=TRUE) # file specifies the file path # sheetIndex specifies the the index of the sheet to be read. # header, if true first row is used as column names. 129 Statistics with R Programming 4.5 Writing Data There are numerous methods for exporting R objects into other formats. For SPSS, SAS and Stata, you will need to load the foreign packages. For Excel, xlsReadWrite package can be used. 2.21 To a Tab Delimited Text File write.table(mydata, "data.txt", sep="\t") 2.22 To an Excel Spreadsheet library(xlsx) write.xlsx(mydata, "data.xlsx") 4.6 Basic SQL queries in R There are numerous ways to query data with R. This section shows you three of the most common ways: 1. Using DBI 2. Using dplyr syntax 3. Using R Notebooks 130 Statistics with R Programming Several recent package make it easier to use databases within R. The query examples below demonstrate the capabilities of these R packages. ● DBI: The DBI specification has gone through many recent improvement. When working with databases, one should always use packages that are DBI-compliant. ● dplyr & dbplyr: dplyr package has a generalized SQL backend for talking databases, and the new dbplyr package translates R code into database-specific variants. SQL variants are supported for the following databases: Microsoft SQL Server, Oracle, PostgreSQL, Apache Hive, Amazon Redshift, and Apache Impala. ● odbc: The odbc R package provides a way to connect to any database as long as you have an ODBC driver installed. The odbc R package is DBI-compliant, and is recommended for ODBC connections. Example: Query bank data in an Oracle database In this example, we will query bank data in an Oracle database. We connect to the database by using the DBI and odbc packages. This specific connection requires a database driver and a data source name (DSN) that have both been configured by the system administrator. Your connection might use another method. library(DBI) 131 Statistics with R Programming library(dplyr) library(dbplyr) library(odbc) con <- dbConnect(odbc::odbc(), "Oracle DB") 2.22.1 Query using DBI You can query the data with DBI by using the dbGetQuery() function. Simply write SQL code into the R function as a quoted string. dbGetQuery(con,'select "month_idx", "year", "month", sum(case when "term_deposit" = \'yes\' then 1.0 else 0.0 end) as subscribe, count(*) as total from "bank" group by "month_idx", "year", "month"') 4.7 Web Scraping Create a Scraping Function First, you will need to load all the libraries for this task. # General-purpose data wrangling library(tidyverse) # Parsing of HTML/XML files 132 Statistics with R Programming library(rvest) # String manipulation library(stringr) # Verbose regular expressions library(rebus) # Eases DateTime manipulation library(lubridate) Extract the Information of One Page We want to extract the review text, rating, name of the author and time of submission of all the reviews on a subpage. Web page given below: 133 Statistics with R Programming For each of the data fields we write one extraction function using the tags. 134 Statistics with R Programming get_reviews <- function(html){ html %>% # The relevant tag html_nodes('.review-body') %>% html_text() %>% # Trim additional white space str_trim() %>% # Convert the list into a vector unlist() } get_reviewer_names <- function(html){ html %>% html_nodes('.user-review-name-link') %>% html_text() %>% str_trim() %>% unlist() } 135 Statistics with R Programming The datetime information is a little trickier, as it is stored as an attribute. In general, you look for the most broad description and then try to cut out all redundant information. Because time information not only appears in the reviews, you also have to extract the relevant status information and filter by the correct entry. get_review_dates <- function(html){ status <- html %>% html_nodes('time') %>% # The status information is this time a tag # attribute html_attrs() %>% # Extract the second element map(2) %>% unlist() dates <- html %>% html_nodes('time') %>% html_attrs() %>% map(1) %>% # Parse the string into a datetime object with 136 Statistics with R Programming # lubridate ymd_hms() %>% unlist() # Combine the status and the date information # to filter one via the other return_dates <- tibble(status = status, dates = dates) %>% # Only these are actual reviews filter(status == 'ndate') %>% # Select and convert to vector pull(dates) %>% # Convert DateTimes to POSIX objects as.POSIXct(origin = '1970-01-01 00:00:00') # The lengths still occasionally do not lign # up. You then arbitrarily crop the dates to # fit 137 Statistics with R Programming # This can cause data imperfections, however # reviews on one page are generally close time) length_reviews <length(get_reviews(html)) return_reviews <- if (length(return_dates)> length_reviews){ return_dates[1:length_reviews] } else{ return_dates } return_reviews } The last function we need is the extractor of the ratings. We will use regular expressions for pattern matching. get_star_rating <- function(html){ # The pattern you look for: the first digit after 138 Statistics with R Programming # `count-` pattern = 'count-'%R% capture(DIGIT) ratings <- html %>% html_nodes('.star-rating') %>% html_attrs() %>% # Apply the pattern match to all attributes map(str_match, pattern = pattern) %>% # str_match[1] is the fully matched string, the # second entry # is the part you extract with the capture in your # pattern map(2) %>% unlist() # Leave out the first instance, as it is not part # of a review ratings[2:length(ratings)] } 139 Statistics with R Programming After we have tested that the individual extractor functions work on a single URL we will then combine them to create a tibble, which is essentially a data frame, for the whole page. get_data_table <- function(html, company_name){ # Extract the Basic information from the HTML reviews <- get_reviews(html) reviewer_names <- get_reviewer_names(html) dates <- get_review_dates(html) ratings <- get_star_rating(html) # Combine into a tibble combined_data <- tibble(reviewer = reviewer_names, date = dates, rating = ratings, review = reviews) # Tag the individual data with the company name 140 Statistics with R Programming combined_data %>% mutate(company = company_name) %>% select(company, reviewer, date, rating, review) } We wrap this function in a command that will extracts the HTML from the URL such that handling becomes more convenient. get_data_from_url <- function(url, company_name){ html <- read_html(url) get_data_table(html, company_name) } In the last step, we will apply this function to many URLs. To do this, we use the map() function. It applies the same function over the items of a list. Finally, we will write one convenient function that takes as input the URL of the landing page. It extracts all reviews, binding them into one tibble. The map function applies the get_data_from_url() function. scrape_write_table <- function(url, company_name){ # Read first page first_page <- read_html(url) 141 Statistics with R Programming # Extract the number of pages that have to be # queried latest_page_number <get_last_page(first_page) # Generate the target URLs list_of_pages <- str_c(url, '?page=', 1:latest_page_number) # Apply the extraction and bind the individual # results back into one table, # which is then written as a tsv file into the # working directory list_of_pages %>% # Apply to all URLs map(get_data_from_url, company_name) %>% # Combine the tibbles into one tibble bind_rows() %>% # Write a tab-separated file write_tsv(str_c(company_name,'.tsv')) } 142 Statistics with R Programming We save the result to disk in a tab-separated file, instead of the common comma-separated files (CSV), because the reviews may contain commas, which may confuse the parser. 143 Statistics with R Programming APPENDIX A Reserved words in R Programming Reserved words are the set of words that have special meaning and function. These keywords cannot be written as identifiers. Following table highlights the reserved words of R programming. Reserved words in R list can be viewed by typing help(reserved) or ?reserved at the R command prompt. for while If else repeat function in next break TRUE FALSE NULL Inf NaN NA NA_integer_ NA_real_ NA_complex_ 144 Statistics with R Programming NA_character … switch 145 Statistics with R Programming APPENDIX B Standard identifiers of R programming Identifiers can be a combination of letters, digits, period (.) and underscore (_). It must start with a letter or a period. If it starts with a period, it cannot be followed by a digit. Reserved words in R cannot be used as identifiers. total Sum .fine.with.dot aVariableName as.numeric levels mydata kgoto sum mean x1 x2 y1 y2 uif eend aand sdo iwhile vector1 array1 list1 matrix1 klabel tchar smod number1 APPENDIX C 146 Statistics with R Programming Standard Procedures in R Programming Procedure Purpose help() Apropos() library(help=packageName) example() args() opens help page displays all objects matching topic help on a specific package provides a sample demo arguments for a function Help on control flow statements (e.g. if, for, while) Help on operators acting to extract or replace subsets of vectors Help on logical operators Help on regular expressions used in R Help on R syntax and giving the precedence of operators add elements to a vector Combine vectors by row/column regular expressions test if 2 objects are exactly equal no. of elements in vector list objects in current environment minimum and maximum repeat the number x, n times elements of x in reverse order sequence (x to y, spaced by n) sort the vector x list the sorted element numbers of x Convert string to lower/upper case letters remove duplicate entries from vector rounding functions return working directory set working directory ?Control ?Extract ?Logic ?regex ?Syntax append() cbind() grep() identical() length() ls() range(x) rep(x,n) rev(x) seq(x,y,n) sort(x) order(x) tolower(),toupper() unique(x) round(x), signif(x), trunc(x) getwd() setwd() 147 Statistics with R Programming APPENDIX D Standard Functions of R Programming Function sqrt() sum() log(x), log10(), exp() cos(x) sin(x) tan(x) %% %/% %*% union() intersect() setdiff() eigen() deriv() integrate() Purpose Gives square root of a number For Addition on numbers Gives logarithmic value of x Gives cosine value in degree or radians Gives sine value in degree or radians Gives tangent value in degree or radians Modulus value of a number Integer division Matrix multiplication Gives union of a set Gives intersect of a set Gives comparison difference of two sets Gives Eigen values and Eigen Vectors symbolic and algorithmic derivatives of simple expressions Adaptive quadrature over a finite or infinite interval Parameter Type Integer, real Integer, real Integer, real Integer, real Integer, real Integer, real Integer, real Integer Integer, real Character Character Character Integer, real Integers, real Integer, real 148 Statistics with R Programming Bibliography [1] Longhow Lam, A guide to Eclipse and the R plug-in StatET. www.splusbook.com, 2007 [2] Diethelm Wurtz, “S4 ‘timedate’ and ‘timeseries’ classes for R,” Journal of Statistical Software. [3] Longhow Lam, An Introduction to R Programming. Springer, 2010. [4] D. M. Bates and D. G. Watts (1988), Nonlinear Regression Analysis and Its Applications. John Wiley & Sons, New York. [5] S. D. Silvey (1970), Statistical Inference. Penguin, London. [6] W. N. Venables and B. D. Ripley, S Programming. Springer, 2000. [7] D. Samperi, “Rcpp: R/C++ interface classes, using c++ libraries from R,” 2006. Online Resources [1] https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf Emmanuel Paradis, R for Beginners [2] http://www.columbia.edu/~cjd11/charles_dimaggio/DIRE/resources/R/rFunctionsList.pdf Charles DiMaggio, List of useful R Functions, Feb 2013. [3] https://www.rstudio.com/wp-content/uploads/2016/05/base-r.pdf Base, R-Studio Cheatsheet [4] https://www.datamentor.io/r-programming/ Learning R Programming [5] https://data-flair.training/blogs/r-tutorial/ R tutorial | Introduction to R Programming – Features and Applications 149