Uploaded by Noorul Amin

1 2

advertisement
Table of Contents
1.1
Introduction ........................................................................................................................ 1
1.2
Data Science with R ............................................................................................................ 4
How to work with R ........................................................................................................................ 4
1.3
R Session and Functions...................................................................................................... 7
1.4
Business Analytics, Data and Information ........................................................................ 11
1.5
Basic Math in R ................................................................................................................. 13
1.5.1
Variables in R ............................................................................................................ 21
1.6
Advanced Data Structures in R ......................................................................................... 24
1.7
Understanding Business Analytics with R ......................................................................... 30
1.8
Comparison of R with other Software for Analytics ......................................................... 32
1.9
Installation of R ................................................................................................................. 36
1.10
Using R Command Line Interface...................................................................................... 43
1.11
Exploring and Learning RStudio ........................................................................................ 45
1.11.1
Data Management in RStudio ................................................................................... 48
1.11.2
Importing Data in RStudio ........................................................................................ 49
1.11.3
Exporting, Viewing and Removing Data.................................................................... 51
1.12
Using Help Feature in R .................................................................................................... 52
1.13
Chapter Summary ............................................................................................................. 53
1.14
Online Resources .............................................................................................................. 53
1.15
Exercise ............................................................................................................................. 54
2.1
Introduction to R Programming ........................................................................................ 57
2.2
Variables in R .................................................................................................................... 59
2.2.1
What is Variable? ...................................................................................................... 59
2.2.2
Assigning values to Variables .................................................................................... 60
2.2.3
Good Practices .......................................................................................................... 60
2.2.4
Creating new Variables ............................................................................................. 60
2.2.5
Can we Rename Variables ........................................................................................ 61
2.3
Scalars in R ........................................................................................................................ 61
2.4
Data Types in R ................................................................................................................. 64
2.4.1
Vectors ...................................................................................................................... 64
2.4.2
Matrices .................................................................................................................... 66
2.4.3
What is List in R ......................................................................................................... 70
2.5
Data Frames ...................................................................................................................... 72
2.6
Arrays ................................................................................................................................ 75
2.7
Classes and Objects........................................................................................................... 77
2.8
R Programming Structures................................................................................................ 79
2.8.1
2.9
R Arithmetic Operators ..................................................................................................... 84
2.9.1
2.10
R Control Statements ................................................................................................ 80
Values........................................................................................................................ 86
Functions in R.................................................................................................................... 87
2.10.1
Does R have Pointers? .............................................................................................. 91
2.10.2
Recursion .................................................................................................................. 92
2.11
Use of c and cbind in R...................................................................................................... 94
2.11.1
Rbind/ rbind Function ............................................................................................... 96
2.11.2
R attach() and detach() Functions............................................................................. 97
2.12
Chapter Summary ........................................................................................................... 102
2.13
Online Resources ............................................................................................................ 102
2.14
Exercises.......................................................................................................................... 103
APPENDIX A................................................................................................................................. 105
APPENDIX B ................................................................................................................................. 146
APPENDIX C ................................................................................................................................. 146
APPENDIX D ................................................................................................................................. 148
Bibliography ................................................................................................................................... 149
CHAPTER
01
Introduction to Data Analytics
LEARNINGOBJECTIVES
In this chapter, you will learn:




How to install and configure R along with writing your first program in R
A brief Introduction on Data Analytics
Learn about IDE named R Studio
How and what R Functions are defined and work
1.1 Introduction
Nowadays the amount of data generated by wide areas of advanced technologies such as
social media networking sites like Instagram, Facebook, Twitter or E-Commerce sites etc.,
is very huge and it becomes difficult to store such gigantic data by using the traditional
data storage facilities. Data is made perpetually, and at an ever-increasing rate. Information
from Mobile phones, social media, imaging technologies which determine a medical
diagnosis-all these and additional produce new information, which should be kept
somewhere for some purpose safe and be able to retrieve it whenever needed. Devices and
sensors mechanically generate diagnostic information that must be stored and processed in
real time. just maintaining with this large inflow of knowledge is troublesome, however
substantially more difficult is analyzing immense amounts of it, particularly once it doesn't
adjust to ancient notions of knowledge structure, to spot purposeful patterns and extract
Statistics with R Programming
helpful info. Several industries have led the means in developing their ability to collect and
exploit data:
Credit card corporations monitor each purchase their customers create and may determine
fallacious purchases with a high degree of accuracy exploitation rules derived by
processing billions of transactions.
Mobile phone corporations analyze subscribers' calling patterns to see, as an example,
whether or not a caller's frequent contacts area unit on a rival network. If that rival network
is providing a pretty promotion which may cause the subscriber to defect, the itinerant
company will proactively provide the subscriber an incentive to stay in her contract.
The valuations of those corporations are heavily derived from the information they gather
and host, that contains a lot of and a lot of intrinsic worth because the knowledge grows.
 Sources of Big Data:
Big Data is generated from numerous platforms and technologies these days some of them
are:
Sources of Big Data

Stock Exchange

Social Media Data

Video Sharing Portals
2
Statistics with R Programming


Transport Data
Banking Data
 Stock Exchange:
The data in the share market regarding information about prices and status details
of share of thousands of companies is very huge.
 Social Media Data:
The data of social networking sites contains information about all the account
holders, their posts, chat history, advertisements etc. On popular websites or
applications such as Facebook and Instagram, there are billions of users producing
huge data overall.
 Video Sharing Portals:
Video sharing portals like YouTube, Vimeo etc., contains millions of videos each
of which requires lots of memory to store.
 Transport Data:
Transport data contains information about model, capacity, distance and
availability of different vehicles and their status.
 Banking Data:
The big giants in banking domain like State Bank of India or ICICI hold large
amount of data regarding huge transactions of account holders.
3
Statistics with R Programming
1.2 Data Science with R
Data Science is an interesting study about handling data, rather not just handling but
studying the science of data and every aspect of it. It is an amazing discipline that allows
you to transform the complex raw data into the knowledgeable, interesting and
understanding insight information. The motivation behind R programming for Data
Science is to help the users to understand the most important tools of R that will help its
users to perform various transformations on data to make use of it.

Defining Data Science
Data Science is a field of Big Data which searches for providing meaningful
information from huge amounts of complex data. Data science is a system used for
retrieving the information in different various forms, either in structured or
unstructured.
Data Science combines different areas or fields of work / study in statistics and
computation in order to understand the data for the purpose of decision making.
How to work with R
First of all to work with data in R you have to import the data using “import” interface
button. This is just telling the tool to store the data in a file, database or web Application
Interface for reducing loading times into a data frame. Without having data imported in
R further process for data science cannot be done!
4
Statistics with R Programming
The next thing to do is to tidy your data. Yes, you heard right, tidy your data. Data is in
complex form most of the times and hence it has to be made tidy meaning storing your
data in consistent manner that matches the linguistics of the dataset with the source it is
stored. This concludes that your data is formatted in a tidy manner, each column is a
variable, and each row is an observation. This process or step is important as the
prominent structure lets you focus the questions arising to you about the data, not being
able to get the information out of the complex data from different functions.
After the data is kept in a tidy manner the step further is to transform the data. This
includes narrowing in on observations of interest (like all individuals in one town, or all
knowledge from the last year), making new variables that are functions of existing
variables (like computing speed from distance and time), and hard a collection of
outline statistics (like counts or means). Together, tidying and reworking are referred to
as haggle, as a result of obtaining your knowledge in an exceedingly kind that’s natural
to figure with typically sounds like a fight!
Visualisation may be a basically human action. But a decent visual image
can show you things that you simply failed to expect, or raise new questions
about the information. A decent visual image may also hint that you’re
asking the incorrect question otherwise you ought to collect completely
5
Statistics with R Programming
different information. Visualizations will surprise you but however don’t scale notably
well as a result of they need some human to interpret them.
Another parameter is Models which are optional tools to visualizations. Once you have
got created your queries sufficiently precise, you'll be able to use a model to answer
them. Models are basically mathematical or procedure tool so that they typically scale
well.
The final step of data science is communication, a very essential and absolutely
important step of any data analysis project. It doesn’t matter however well your models
and visualization have crystal rectifier you to grasp the information unless you'll
additionally communicate your results to others.
 How to run an R Program
This section will guide you through how to write and run a code in R programming
Before you run R codes you need few things to start with: such as R (the base
language), RStudio (an IDE for writing and executing R projects), a collection of R
packages called the tidyverse, and a few other important or additional packages.
Packages are the basic unit of reusable R code. They include reusable functions, the
documentation manual that describe how to use them, and sample data.
Couple of examples of R code:
5+2
#> [1] 7
#> [1] 7
6
Statistics with R Programming
If you run your code on terminal or console it will look like this:
5+2
> [1] 7
1.3 R Session and Functions
Just like some languages in computer science field R is a functional language. Most of the
computation in R is handled victimization functions. The R language environment is meant
to facilitate the event of latest scientific computation tools.
 Working with R Session
In two methods the R programming can be achieved. One can either type the command
lines on the console inside an "R-session", or one can save the commands as a "script" file
and execute the whole file inside R. First let us talk about R-session.
At the beginning of an R session type ‘R’ on the command line cmd in Windows or
terminal in Linux OS. Below is an example, at terminal where shell prompts ‘$’ in Linux
type
$R
This will generate the below given output before entering the '>>' prompt of R:
R version 3.2.1 (2015-07-10) -- "Stay With Me"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
7
Statistics with R Programming
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
[Previously saved workspace restored]
>
After we are inside with the R session, one can directly execute the R language commands
by typing them line by line in console window. With pressing the enter key on the
keyboard will terminate the typing of command and brings the ‘>’ prompt again.
In the example session below, we had declared two variables namely 'a' and 'b' which are
assigned values Five and Six respectively and also assigned their addition result to another
variable called 'c':
>
>
>
>
a = 5
b = 6
c = a + b
c
The value of c will be seen as,
[1] 11
In R session, by typing a variable name on the screen will print its value on the console.
8
Statistics with R Programming
 Saving in the R Session
It is to be noted that by failing to save the current session, one can lose all the
memory of the current session commands and the variables and objects created
after exiting R prompt.
When we work with R, the R objects we tend to created and loaded area unit keep
during a memory portion referred to as space. After we say 'no' to avoid wasting the
space, we tend to of these objects area unit drained from the space memory. If we
are saying 'yes', they're saved into a file referred to as ".RData" is written to the
current operating directory.
In Linux, this "working directory" is mostly the directory from wherever R was
started through the command 'R'. In windows, it will be either "My Documents" or
user's home directory.
When we begin R within the same current directory next time, the work area and
every one the created objects area unit fixed mechanically from this ".RData"
directory.
 Exit the R Session
One can exit the R session, by typing the quit() command in the R prompt, and say
'n' (no) for saving the workspace image. This simply means that we do not want to
save the memory of all the commands we typed in the current session:
> quit()
Save workspace image? [y/n/c]: n
>
9
Statistics with R Programming
Functions in R
One of the most effective ways in which to enhance your reach as an information soul is to
write down functions. Functions permit you to change common tasks during a lot of
powerful and general manner than copy-and-pasting. Writing a function has 3 huge
blessings over victimization copy-and-paste:
1. You can give a function a memorable or describing name that makes your code
easier to understand.
2. As the requirements may change, you only need to update the code in one place,
instead of many.
3. You reduce the chance of making incidental errors when you copy and paste
(i.e. updating a variable name in one place, but not in another).
Writing smart functions could be a time period journey. Even once victimization R for
several years I still learn new techniques and higher ways that of approaching recent
issues. The goal of this chapter isn't to show you each abstruse detail of functions
however to urge you started with some pragmatic recommendation that you simply will
apply now.
As well as sensible recommendation for writing functions, this chapter additionally
offers you some suggestions for a way to vogue your code. Smart code vogue is like
correct punctuation.
You can manage without it but however it certain makes things easier to read. Such
types of punctuation, there are several unit attainable variations. Here we have a
tendency to gift the design we have a tendency to use in our code, however the
foremost vital issue is to be consistent.
 Function Definition in R
10
Statistics with R Programming
An R function is generated by using the keyword function. The basic syntax of
an R function definition is as follows −
function_name <- function(arg_1, arg_2, ...) {
Function body
}
 Components of Function in R
There are different components in R Function
1. Name of the Function – This is the actual name of the function. It is stored in R
environment as an object with this name.
2. Arguments − an argument or parameter is a placeholder for values to pass to
function while calling the function. When a function is invoked, you pass a value
to the argument. Arguments are optional; that is, a function may contain no
arguments. Also arguments can have default values.
3. Function Body − the function body contains a collection of statements that defines
what the function does.
4. Return Value − the return value of a function is the last expression in the function
body to be evaluated.
R has many pre-built performing functions which can be directly invoked in the program
without defining them first. We can also create and use our own functions referred as user
defined or user declared functions.
1.4
Business Analytics, Data and Information
11
Statistics with R Programming
Business analytics begins with a data set (a simple collection of data or a data file) or
commonly with a database (a collection of data files that contain information on people,
locations, and so on). As databases grow, they need to be stored somewhere. Technologies
such as computer clouds (hardware and software used for data remote storage, retrieval,
and computational functions) and data warehousing (a collection of databases used for
reporting and data analysis) store data. Database storage areas have become so large that a
new term was devised to describe them. Big data describes the collection of data sets that
are so large and complex that software systems are hardly able to process them.
 Definition of Business Analytics:
Business intelligence (BI) can be defined as a set of processes and technologies that
convert data into meaningful and useful information for business purposes.
 Business Analytics Examples
Business analytics techniques break down into the two main areas. The primary is basic
business intelligence. This involves examining historical knowledge to urge a way of
however a department, team or staffer performed over a selected time. This can be a
mature observe that almost all enterprises area unit fairly accomplished at exploitation.
The second space of business analytics involves deeper applied math analysis. This could
mean doing prophetical analytics by applying applied math algorithms to historical
knowledge to form a prediction concerning future performance of a product, service or
web site style amendment. Or, it might mean victimization alternative advanced analytics
techniques, like cluster analysis, to cluster customers supported similarities across many
knowledge points. This will be useful in targeted selling campaigns, for instance.
Some various types of analytics include:
1. Descriptive analytics, which keeps track of key performance indicators to
understand the present state of a business.
2. Predictive analytics that analyzes trend knowledge to assess the chance of future
outcomes.
3. Prescriptive analytics, that uses past performance to come up with
recommendations regarding the way to handle similar things within the future.
12
Statistics with R Programming
 Business Analytics vs. Data Science
The additional advanced areas of business analytics will begin to corresponding
knowledge science, however there's a distinction. Even once advanced applied
mathematics algorithms are applied to knowledge sets and it does not essentially mean
knowledge science is concerned. There are a bunch of business analytics tools which will
perform these forms of functions mechanically, requiring few of the special skills
concerned in knowledge science.
True knowledge science involves additional custom writing and additional open-ended
queries. Knowledge scientists usually do not begin to unravel a particular question, as most
business analysts do. Rather, they're going to explore knowledge mistreatment advanced
applied mathematics ways and permit the options within the knowledge to guide their
analysis.
1.5 Basic Math in R
In R programming you can perform basic math operations like most programming
languages. Below are the functions:
 Arithmetic Operations:
Operator
Description
+
Addition
-
Subtraction
*
Multiplication
/
Division
^
Exponent
%%
Modulus
%/%
Integer Division
13
Statistics with R Programming
Below is sample of arithmetic operations performed:
> x <- 5
> y <- 16
> x+y
[1] 21
> x-y
[1] -11
> x*y
[1] 80
> y/x
[1] 3.2
> y%/%x
[1] 3
> y%%x
[1] 1
> y^x
[1] 1048576
 Relational Operators in R
Relational operators are used to compare between values. Here is a list of relational
operators available in R.
14
Statistics with R Programming
15
Statistics with R Programming
Operator
Description
<
Less Than
>
Greater Than
<=
Less Than Equal To
>=
Greater Than Equal To
==
Equal to
!=
Not Equal to
Below is sample of relational operations performed:
> x <- 5
> y <- 16
> x<y
[1] TRUE
> x>y
[1] FALSE
> x<=5
[1] TRUE
> y>=20
[1] FALSE
> y == 16
[1] TRUE
> x != 5
[1] FALSE
16
Statistics with R Programming
17
Statistics with R Programming
 Logical Operators in R
All the logical operations in R language are performed using the below mentioned
operators.
Operator
Description
!
Logical NOT
||
Logical OR
|
Element-wise Logical OR
&&
Logical AND
&
Element-wise logical AND
Operators ‘&’ and ‘|’ perform element-wise operation generating result having length of
the longer operand.
And operators ‘&&’ and ‘||’ checks only the first element of the operands resulting into a
single length logical vector.
Zero is thought as a FALSE and non-zero numbers are taken as TRUE. A sample code:
> x <- c(TRUE,FALSE,0,6)
> y <- c(FALSE,TRUE,FALSE,TRUE)
> !x
[1] FALSE
TRUE
TRUE FALSE
> x&y
[1] FALSE FALSE FALSE
TRUE
> x&&y
[1] FALSE
> x|y
[1] TRUE TRUE FALSE TRUE
> x||y
[1] TRUE
18
Statistics with R Programming
19
Statistics with R Programming
 Assignment Operators in R
Moving on we have assignment operators which are used for assigning the values
to variables.
Operator
Description
<-, <<-, =
Leftwards Assignment
->, ->>
Rightwards Assignment
The operators <- and = are often used, nearly interchangeably, to assign to variable within
the same setting.
The <<- operator is employed for distribution to variables within the parent environments
(more like international assignments). The rightward assignments, though accessible are
seldom used.
> x <- 5
> x
[1] 5
> x = 9
> x
[1] 9
> 10 -> x
> x
[1] 10
20
Statistics with R Programming
1.5.1 Variables in R
Variables are like placeholders for values in programming language. Well we will now
discuss variables and constants in R. And will be able to learn best practices for using a
variable in the program.
 A Variable
Variables are used to store data, whose value can be changed according to our need.
Unique name given to variable (function and objects as well) is identifier.
 Rules for writing Identifiers in R
1. Identifiers can be a combination of letters, digits, period (.) and underscore (_).
2. It must start with a letter or a period. If it starts with a period, it cannot be followed
by a digit.
3. Reserved words in R cannot be used as identifiers.
 Valid Identifiers in R
total, Sum, .fine.with.dot, this_is_variable, Number4
 Invalid Identifiers in R
tot@l, 5um, _not-a-variable, TRUE, .0ne
 Good Practices
Earlier versions of R used underscore (_) as an assignment operator. So, the period (.) was
used extensively in variable names having multiple words.
21
Statistics with R Programming
Current versions of R support underscore as a valid identifier but it is good practice to use
period as word separators.
For example, a.variable.name is preferred over a_variable_name or alternatively we could
use camel case as aVariableName
 Constants in R
There are two types of constants in R programing language:

Numeric Constants

Character Constants
The entities whose values cannot be altered or changed once defined are known as
Constants. Basic types of constant are numeric constants and character constants.
 Numeric Constants
All numbers fall under this category. They can be of type integer, double or complex.
They can be also checked with the typeof() function.
Numeric constants followed by L are regarded as integer and those followed by i are
regarded as complex.
> typeof(5)
[1] "double"
> typeof(5L)
[1] "integer"
> typeof(5i)
[1] "complex"
22
Statistics with R Programming
Numeric constants preceded by 0x or 0X are interpreted as hexadecimal numbers.
> 0xff
[1] 255
> 0XF + 1
[1] 16
 Character Constants
Character constants can be denoted using either by single quotes (') or double quotes (") as
delimiters.
> 'example'
[1] "example"
> typeof("5")
[1] "character"

Predefined Constants in R
Some of the built-in constants defined in R along with their values are shown below.
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
> pi
[1] 3.141593
> month.name
[1] "January"
"February"
"March"
"April"
[7] "July"
"August"
"September" "October"
"May"
"June"
"November"
"December"
23
Statistics with R Programming
1.6 Advanced Data Structures in R
Data structure can be outlined as the specific form of organizing and storing the data. R
programming supports five basic types of data structure namely vector, matrix, list, data
frame and factor specific variety of organizing and storing the information. R
programming
supports five basic sorts
of organization specifically vector,
matrix,
list, knowledge frame and issue. We will be discussing these knowledge structures and
therefore the thanks to write these in R Programming.
R has variety of data structures to dive into. Let us see some of them:

Vector

Matrices

Data Frame

List

Factor
 Vector
24
Statistics with R Programming
This data structures contain similar types of data, i.e., integer, double, logical, complex,
etc. In order to create a vector in R Programming, c() function is used. They are useful and
widely used.
25
Statistics with R Programming
For example,
> x <- 1:7; x[1] 1 2 3 4 5 6 7 > y <- 2:-2; y[1]
2
1
0 -1 -2
a <- c(1,2,5.3,6,-2,4) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) #logical vector
 Matrices
Matrix is a two-dimensional (2D) data structure that can be created using matrix () function
in R programing language. The values for rows columns can be defined using nrow and
ncol arguments. However providing both is not required as other dimension is
automatically taken with the help of length of matrix. All columns in a matrix must have
the same mode (numeric, character, etc.) and the same length. The general format is
mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE,
dimnames=list(char_vector_rownames, char_vector_colnames))
byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates
that the matrix should be filled by columns (the default). dimnames provides optional
labels for the columns and rows.
# generates 5 x 4 numeric matrix
y<-matrix(1:20, nrow=5,ncol=4)
# another example
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
dimnames=list(rnames, cnames))
26
Statistics with R Programming
27
Statistics with R Programming
 Lists
This data structure named List is list like structure virtually including data of different
types. It is similar to vector but a vector contains similar data but list contains mixed data.
A list is created using list(). An ordered collection of objects (components). A list allows
you to gather a variety of (possibly unrelated) objects under one name.
# example of a list with 4 components # a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
# example of a list containing two lists
v <- c(list1,list2)

Data Frame
This data structure is a special case of list where each component is of
same length. Data frame is created using frame() function.
For example:
> x <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora"))
> str(x) # structure of x
'data.frame': 2 obs. of 3 variables:
$ SN : int
1
$ Age : num 21
2
15
$ Name: Factor w/ 2 levels "Dora","John": 2 1
28
Statistics with R Programming
29
Statistics with R Programming
 Factors
Indicate R that a variable is nominal by declaring it as a factor. The factor stores the
nominal values as a vector of integers in the range [ 1... k ] (where k is the number of
unique values in the nominal variable), and an internal vector of character strings (the
original values) mapped to these integers.
# variable gender with 20 "male" entries and
# 30 "female" entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 1s and 30 2s and associates
# 1=female, 2=male internally (alphabetically)
# R now treats gender as a nominal variable
summary(gender)
1.7 Understanding Business Analytics with R
Today, it's imperative for each fashionable business to know the large amounts of
knowledge it maintains on its customers and itself. By exploiting spreadsheets is futile and
although SAS offers an answer, it's not the simplest one. The R programing language is
associate degree open supply programing language that has been wide utilized by scientists
across the planet. It’s a language that may conjointly facilitate businesses analyze large
amounts of knowledge simply and effectively.
The R programing language makes it straightforward for a business to travel through the
business’s entire information. What the language will is it scales the knowledge in order
that completely different and parallel processors will work upon the knowledge at the same
time. Once employing a regular R package, most computers don't usually have spare
memory to handle high amounts of information. However, the R programing language
30
Statistics with R Programming
offers pulse counter, which is able to repurpose the knowledge into smaller chunks in order
that the knowledge can then be processed on varied servers at constant time. In different
words, pulse counter makes it straightforward to divide an enormous info across
completely different nodes.
This permits users of the programing language to form analyses of applied mathematics
info in an exceedingly very subtle manner. Moreover, the language additionally makes it
potential for programmers to simply perform periodic checks on the knowledge because it
is being processed. This advantages business as a result of they will use high amounts of
information and fine-tune it to try to a lot of subtle analyses.
To demonstrate to you the intensity of the R programming dialect let us take a basic
precedent.
A business with thirty million columns of data for sixty distinct factors would now be able
to be broke down in only ten minutes with the assistance of an R bundle. SAS can't
contend with this, which is the reason such a large number of organizations and
organizations are getting to be enthusiasts of R.
Furthermore, R additionally furnishes organizations with the best perspective of the sort of
data that it is managing. The pleasant thing about utilizing R is that organizations don't
require creating modified devices and they additionally don't have to compose a great deal
of code. R enables the business to effortlessly and rapidly spotless and examines the data
that it needs to dissect.
The decent thing about this dialect is that it enables the business to investigate new data
with the assistance of redid representations. The straightforward truth is that R is extremely
solid with regards to perceptions and illustrations. It enables organizations to effortlessly
make the most engaging designs, which is something that SAS just can't do. Truth be told,
SAS's illustrations are extremely appalling and horrendous.
31
Statistics with R Programming
1.8 Comparison of R with other Software for Analytics
There has been already a huge debate on this topic about comparison of R programming
with SAS or Python for Data Analytics.
Firstly SAS and R both are in argument for Data Science. Let us see some comparative
factors between them.
 SAS
SAS (Statistical Analysis Software) is a product suite that can mine, adjust, oversee
and recover information from an assortment of sources and perform measurable
investigation on it. SAS has turned into the undisputed market pioneer in business
examination space. The program offers colossal assortment of factual capacities has
an incredible GUI (Enterprise Guide and Miner) for people to catch on quickly and
supplies specialized help that is brilliant. By the by, it winds up being the most
astounding evaluated elective and isn't continually advanced with most recent
factual capacities.
 R
R is the Open source partner of SAS, which has been connected in scientists and
teachers. Because of its own open source nature, most recent systems get
discharged promptly.
There's a lot of documentation available over the web and it's an exceptionally
savvy elective. Fundamentally R is a programming dialect and free programming
condition for measurable figuring and designs bolstered by the R Foundation for
Statistical Computing.
1. Accessibility/Price:
32
Statistics with R Programming
SAS is business programming. It's not modest and still distant for most of the
experts (in individual capacity). In any case, it holds the biggest piece of the pie
in Private Organizations. In this way, except if you're in an Organization that
has put resources into SAS and until, it may not be anything but difficult to
achieve one.
Or then again, R is free and might be downloaded by anybody.
2. Simplicity of learning:
SAS isn't hard to learn and supplies straightforward decision (PROC SQL) for
people who as of now comprehend SQL. Something else, it's an extraordinary
secure GUI interface in its vault. As to assets, there are instructional exercises
accessible on locales of various colleges and SAS has an entire guidance
manual. Individuals include some significant pitfalls, in spite of the fact that
there are affirmations from SAS preparing establishments.
3. Advancement in application:
Each of the both these communities have the entire fundamental principal and
most required capacities open. This trait just issues on the off chance that you
happen to chip away at calculations and most recent advancements.
Because of their open nature, R gets the most recent highlights quick (R
contrasted with Python). SAS, then again redesigns its capacities in new variety
rollouts. Advancement of new methods is speedy, since R has been utilized
broadly in teachers before.
SAS discharge overhauls in overseeing condition, in this way they're very much
examined. R on the other side, there is open doors for mix-ups in the most
recent advancements and has opened commitments.
33
Statistics with R Programming
4. What is the contention?
It is truly not unreasonably simple, Linux may do everything Windows and
that's just the beginning, however Windows still guidelines. Among the best
explanations behind proceeded with Windows predominance is a simpler client
experience and catalyst.
Despite the majority of the focal points Linux offers (better security, no
infections, and comparable client encounter especially in the Ubuntu shapes),
the regular man favors Windows, not to state Linux does not have an energetic
help network and its resolute after.
5. Statistical Capacity:
Different SAS projects and SAS Stat pack a solid power and cover basically the
whole array of methods and measurable assessment. However since R is open
source and people can present their specific projects/libraries, the latest
bleeding edge procedures are constantly discharged in R. To date R has about
15,000 projects in the CRAN (Comprehensive R Archive Network - The site
which keeps the R work) storehouse.
Some of the latest systems like GLMET, ADABoost RF, are available to be
utilized in R yet not in SAS. Numerous trial programs are additionally realistic
in R. Truth be told, in most Kaggle contenders (which needs its very own site
post), the victor (who are among the world's best information diggers) have
almost constantly utilized R to develop their models.
6. Client Care Support and Network:
34
Statistics with R Programming
R has the biggest online network yet no client care bolster. On the off chance
that you have issue, you're without anyone else's input. You will get a lot of
assistance.
SAS then again has committed client benefit notwithstanding the network. In
this way, on the off chance that you experience issues in some other particular
difficulties or setup, you can contact them.
 Conclusion
Absolutely, there's not really a winner in this race. It will be untimely to put down
wagers on what will prevail, given the dynamic character of business. Controlled
by your conditions (vocation stage, Financials and so forth) you can include your
own weights and concoct what may be proper for you. Here are a few situations
that are unique:
In case you're a fresher entering in the examination division, we suggest you to
learn SAS as your first dialect. It holds most prominent employment piece of the
pie and is anything but difficult to learn.
In case you're somebody who has invested energy in business, your ability should
attempt and expand take in a fresh out of the box new apparatus.
35
Statistics with R Programming
For experts and Masters in business, people should know something like 2 of these.
That will include loads of adaptability for future and open up new possibilities.
In case you're in a startup/outsourcing, R is less futile.
1.9 Installation of R
R Language Project (https://www.r-project.org) is an ordinarily utilized free Statistics
programming. R enables you to complete measurable examinations in an intelligent mode,
and in addition permitting straightforward programming.
To utilize R, you first need to introduce the R program on your PC.
How will you check if R is installed on a Windows PC
Before you introduce R on your PC, the principal activity is to check whether R is as of
now introduced on your PC (for instance, by a past client).
These directions will centre around introducing R on a Windows PC. In any case, we will
likewise quickly notice how to introduce R on a Macintosh or Linux PC (see beneath).
In the event that you are utilizing a Windows PC, there are two different ways you can
check whether R is as of now introduced on your PC:
 Look up if there happens to be an “R” icon on the desktop-screen of your computer
that you are using. If so, double-click on the “R” icon to begin R. If you cannot find
an “R” icon, try this next step instead.
 Click on the “Start” menu at the bottom left of your Windows desktop, and then
move your mouse over “All Programs” in the menu that pops up. See if “R”
appears in the list of programs that pops up. If it does, it means that R is already
installed on your computer, and you can start R by selecting “R” (or R X.X.X,
where X.X.X gives the version of R, for instance, R 2.10.0) from the list.
Assuming either (1) or (2) above succeeds in beginning R, it implies that R is as of
now introduced on the PC that you are utilizing. (In the event that neither succeeds,
R isn't introduced yet). On the off chance that there is an old variant of R
introduced on the Windows PC that you are utilizing, it merits introducing the most
recent form of R, to ensure that you have all the most recent R capacities accessible
to you to utilize.
36
Statistics with R Programming
To check what is latest version of R
To discover what the most recent rendition of R is, you can take a gander at the CRAN
(Comprehensive R Network) site, http://cran.r-project.org/.
Close to "The most recent discharge" (about mostly down the page), it will state something
like "R-X.X.X.tar.gz" (ex. "R-2.12.1.tar.gz"). This implies the most recent arrival of R is
X.X.X (for instance, 2.12.1).
New arrivals of ‘R’ are made frequently (around once every month), as R is effectively
being enhanced constantly. It is beneficial putting in new forms of R routinely, to ensure
that you have an ongoing rendition of R (to guarantee similarity with all the most recent
adaptations of the R bundles that you have downloaded).
Installation steps for R on Windows Operating System
To install R consider the steps given below:
1. Open your browser and type the url: https://cran.r-project.org
2. Under “Download and Install R” select and click the windows link
3. Under "Subdirectories", tap on the "base" connect.
4. On the following page, you should see a connection saying something like
"Download R 2.10.1 for Windows" (or R X.X.X, where X.X.X gives the adaptation
of R, eg. R 2.11.1). Tap on this connection.
37
Statistics with R Programming
5. You might be inquired as to whether you need to spare or run a document "R2.10.1-win32.exe". Pick "Spare" and spare the document on the Desktop. At that
point double tap on the symbol for the record to run it
6. You will be requested that what dialect introduce it in - pick English.
7. The R Setup Wizard will show up in a window. Snap "Next" at the base of the R
Setup wizard window.
8. The following page says "Data" at the best. Snap "Next" once more.
9. The following page says "Data" at the best. Snap "Next" once more.
10. The following page says "Select Destination Location" at the best. As a matter of
course, it will recommend to introduce R in "C:\Program Files" on your PC.
11. Snap "Next" at the base of the R Setup wizard window.
12. The following page says "Select segments" at the best. Snap "Next" once more.
13. The following page says "Startup choices" at the best. Snap "Next" once more.
38
Statistics with R Programming
14. The following page says "Select begin menu organizer" at the best. Snap "Next"
once more.
15. The following page says "Select extra undertakings" at the best. Snap "Next" once
more.
16. R should now be introduced. This will take about a moment. At the point when R
has completed, you will see "Finishing the R for Windows Setup Wizard" show up.
Snap "Wrap up".
17. To begin R, you can either pursue stage 18, or 19:
18. Check if there is a "R" symbol on the work area of the PC that you are utilizing.
Provided that this is true, double tap on the "R" symbol to begin R. In the event that
you can't discover a "R" symbol, attempt stage 19.
19. Tap on the "Begin" catch at the base left of your PC screen, and after that pick "All
projects", and begin R by choosing "R" (or R X.X.X, where X.X.X gives the
adaptation of R, eg. R 2.10.0) from the menu of projects.
20. The R support (a square shape) should open up:
39
Statistics with R Programming
Getting RStudio
RStudio is an IDE (Integrated Development Environment) that makes R less
demanding to utilize and is more like SPSS or ‘Stata’. It incorporates a code
proofreader, investigating and representation devices. If it's not too much trouble
utilize it to acquire a pleasant R encounter.
Head over to https://www.rstudio.com/products/rstudio/download/ and choose the
‘FREE’ version to download on your PC.
RStudio is not completely FREE for its advanced features it prompts you to buy the
subscription.
After following the on-screen steps of Installing RStudio on Windows PC, boot up
the RStudio from either by desktop icon or from programs menu.
After booting up successfully RStudio you will see the following window:
40
Statistics with R Programming
41
Statistics with R Programming
The Tidyverse
The Tidyverse allows users access to set of packages that augment R capabilities and share
an underlying design concept.
An astounding model is dplyr, a bundle that truly rearranges information control. Similarly
for instance it gives, among different capacities and abilities, group_by and abridge
capacities to perform activities, for example, SUMIF or SUMIFS from Microsoft Excel.
On the off chance that you need to make plots from R, the Tidyverse gives the ggplot2
bundle to plot creation. There are great instructional exercises to learn ggplot2.
42
Statistics with R Programming
Another cool component is that the Tidyverse gives the haven package to import/trade
information by utilizing SPSS, Stata, and SAS positions.
The establishment guidelines are unique, contingent upon your working framework;
Microsoft Windows, Mac OS X or Ubuntu Linux.
1.10 Using R Command Line Interface
The R Console and other intelligent instruments like RStudio are incredible for prototyping
code and investigating information, yet at some point or another we will need to utilize our
program in a pipeline or run it in a shell content to process a huge number of information
documents. With the end goal to do that, we have to make our projects work like other
UNIX direction line instruments. For instance, we may need a program that peruses an
informational collection and prints the normal aggravation per understanding:
$ Rscript readings.R --mean data/inflammation-01.csv
15.45
15.425
16.1
...
16.4
17.05
5.9
Be that as it may, we may likewise need to take a gander at the base of the initial four lines
$ head -4 data/inflammation-01.csv | Rscript readings.R --min
Or on the other hand the greatest irritations in a few documents in a steady progression:
$ Rscript readings.R --max data/inflammation-*.csv
43
Statistics with R Programming
Running R scripts from the order line can be a great method to:
 Mechanize your R scripts
 Incorporate R into creation
 Call R through different apparatuses or frameworks
There are fundamentally two Linux directions that are utilized. The first is the order,
Rscript, and is favored. The more seasoned order is R CMD BATCH. You can call these
straightforwardly from the direction line or incorporate them into slam content. You can
likewise call these from any activity scheduler.
Note, these are R related instruments. The RStudio IDE does not as of now accompany
apparatuses that improve or deal with the Rscript and R CMD BATCH capacities. Be that
as it may, there is a shell incorporated with the IDE and you could possibly call these
directions from that point. The option in contrast to the utilizing the Linux order line is to
utilize the source() work within R. The source capacity will likewise call content, however
you must be inside a R session to utilize it.
Command-Line Arguments in R
Using the text editor of your choice, save the following line of code in a text file
called session-info.R:
sessionInfo()
The function, sessionInfo, yields the adaptation of R you are running and also the kind of
PC you are utilizing (and additionally the variants of the bundles that have been stacked).
This is extremely valuable data to incorporate when approaching others for help with your
R code. Presently we can run the code in the document we made from the Unix Shell
utilizing Rscript:
R version 3.5.1 (2017-01-27)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
Matrix products: default
BLAS: /home/travis/R-bin/lib/R/lib/libRblas.so
LAPACK: /home/travis/R-bin/lib/R/lib/libRlapack.so
44
Statistics with R Programming
locale:
[1] LC_CTYPE=en_US.UTF-8
[3] LC_TIME=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8
[9] LC_ADDRESS=C
[11] LC_MEASUREMENT=en_US.UTF-8
LC_NUMERIC=C
LC_COLLATE=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8
LC_NAME=C
LC_TELEPHONE=C
LC_IDENTIFICATION=C
attached base packages:
[1] stats
graphics grDevices utils
datasets
methods
base
loaded via a namespace (and not attached):
[1] compiler_3.5.1
1.11 Exploring and Learning RStudio
What is RStudio?
As you might know, RStudio is an Integrated Development Environment (IDE) that
45
Statistics with R Programming
enables you to associate with R all the more promptly. RStudio is like the standard
RGui, however is impressively more easy to use. It has more drop-down menus, windows
with various tabs, and numerous customization choices. The first occasion when you open
RStudio, you will see three windows. A forward window is covered up as a matter of
course, yet can be opened by tapping the File drop-down menu, at that point New File, and
afterward RScript. All wide explained information about data on utilizing RStudio can be
found at RStudio's Website.
RStudio Windows
Console window
Source Tabs
Environment Tab
History Tab
Files Tab
Plots Tab
Packages Tab
Help Tab
Positio
Description
n
Lowerleft
Upperleft
location were commands are
entered
and
the output is
printed
Built-in Text Editor
Upper-
Interactive list of loaded R
right
objects
Upper-
List of key strokes entered into
right
console
Lower-
File explorer to navigate C
right
drive folders
Lowerright
Lowerright
Output location for plots
List of installed packages
Lower-
Output location for help
right
commands and help search
46
Statistics with R Programming
window
Viewer Tab
Lower-
Advanced tab for local web
right
content
 Fundamental Hints for using R
1. R is a Command-Line or Terminal based. It expects you to sort or reorder
directions after an order provoke (>) that shows up when you open R. In the
wake of composing an order in the R comfort and squeezing Enter on your
console, the direction will run. On the off chance that your order isn't finished,
R issues a continuation provoke (shown by plus sign: +). Then again you can
compose content in the content window, and select a direction, and tap the Run
catch.
2. One should know that like most languages R is too case Sensitive, meaning you
variable named aVar and avar are two different variables.
3. Commands in ‘R’ are likewise called capacities. The essential configuration of
a capacity in R is: function.name(argument, options).
4. The up arrow (^) on your keyboard can be used to bring up previous commands
that you’ve typed in the R console.
5. The $ symbol is used to select a particular column within the table
(e.g., table$column).
6. Any text that you do not want R to act on (such as comments, notes, or
instructions) needs to be preceded by the # symbol (a.k.a. hash-tag, comment,
pound, or number symbol). R ignores the remainder of the script line
following #. For instance:
plot(x, y) # This text will not affect the plot function because of the comment
47
Statistics with R Programming
1.11.1 Data Management in RStudio
Before you start working in R, you should set your working registry (an organizer to hold
the majority of your undertaking documents); for instance, "C:\workspace\… ". This
registry is where all your info informational collections are being put away. It additionally
fills in as the default area for plots and different articles traded from R. Whenever set, it
conveniently enables you to import information into R with only a record name, not the
whole document way.
Typically toward the start of every R session you should set your working catalog. To
change the working catalog in RStudio, select the
Files Tab > More > Set as Working Directory.
This can be achieved in the Console by typing:
> setwd("C:/workspace")
# beware R uses forward slashes / instead of back slashes \ for file path names
To check the file path of the current working directory (which should now be
“C:\workspace2”), type:
getwd()
48
Statistics with R Programming
1.11.2 Importing Data in RStudio
After your working index is set, you can import information from .csv, .txt, and so forth.
One fundamental direction for bringing in information into R is read.csv(). The direction is
trailed by the record name and after that some discretionary guidelines for how to peruse
the document.
To begin with, make a model record by copying the contents beneath. Paste the content
into Notepad and save the document as sand_example.csv in your C:\workspace organizer.
location,landuse,horizon,depth,sand
city,crop,A,14,19
city,crop,B,25,21
city,pasture,A,10,23
city,pasture,B,27,34
city,range,A,15,22
city,range,B,23,23
farm,crop,A,12,31
farm,crop,B,31,35
farm,pasture,A,17,30
farm,pasture,B,26,36
farm,range,A,15,25
farm,range,B,24,29
west,crop,A,13,27
west,crop,B,29,25
west,pasture,A,11,21
west,pasture,B,31,26
west,range,A,14,23
west,range,B,24,24
This dataset can either be foreign into R utilizing the Import Dataset catch from the
Environment tab, or by composing the accompanying direction into the R comfort:
sand <- read.csv("C:/workspace/sand_example.csv")
# if your workspace was already set you could simply use the
filename, like so
# sand <-read.csv("sand_example.csv")
49
Statistics with R Programming
50
Statistics with R Programming
1.11.3 Exporting, Viewing and Removing Data
To export data from RStudio, use the command write.csv() function. Since we have
already set our working directory, R automatically saves our file into the working
directory.
write.csv(sand, file = "sand_example2.csv")
# or use the write.table() function to export other text file types
When the document is foreign made, it is basic that you check to guarantee that R
accurately imported your information. Ensure numerical information are effectively
transported in as numerical, that your segment headings are saved, and so on. To see
information basically tap on the sand dataset recorded in the Environment tab. This will
open up a different window that shows a spreadsheet like view.
Moreover
one can use the following functions to view your data in R.
Function
Description
print()
prints all the object(avoid large tables)
head()
prints the first 6 lines of your data
str()
names()
ls()
shows the data structure of an r object
lists the column names(i.e. headers) of data
lists all the r objects in your workspace directory
51
Statistics with R Programming
1.12 Using Help Feature in R
R has broad documentation, various mailing records, and incalculable books (huge
numbers of which are free and recorded at end of every section for this course). The
implicit help records are at times obscure, and the online answers can be brisk, however on
the off chance that you look for help you will discover it. To take in more about the
capacity you are utilizing and the choices and contentions accessible, figure out how to
help yourself by exploiting a portion of the accompanying help works in RStudio:
Utilize the Help tab in the lower-right Window to seek directions, (for example, hist) or
points, (for example, histogram).
 You can use the Help tab in the lower-right Window Tab to search commands
(such as hist) or topics (such as histogram).
 Sort help (read.csv) or ‘?read.csv ’ in the Console window to raise an assistance
page. Results will show up in the Help tab in the lower-right hand window. Certain
capacities may require citations, for example, help ("+").
52
Statistics with R Programming
# Help file for a function
help(read.csv) # or ?read.csv
# Help files for a package
help(package = "soiltexture")
1.13 Chapter Summary
As for the career, multiple ways accessible in huge information keep on developing so
does the lack of huge information experts expected to fill those positions. In the past
segments of this section the qualities should have been effective in the field of huge
information have been presented and clarified. The qualities, for example, correspondence,
information of huge information ideas, and readiness are similarly as critical as the
specialized expertise parts of enormous information.
And speaking of R, it is a tool which is free and open-source, making it feasible for
anybody to approach world-class measurable investigation instruments. Learning R isn't
simple — on the off chance that it was, information researchers wouldn't be in such
popularity. Be that as it may, there is no deficiency of value assets you can use to learn R
in case you're willing to invest the energy and exertion.
1.14 Online Resources
¹http://cran.r-project.org/
The Comprehensive R Archive Network
²http://cran.r-project.org/doc/manuals/r-release/R-intro.html
The R manual: Introduction for R
53
Statistics with R Programming
³http://cran.r-project.org/doc/manuals/r-release/R-data.html
This is a guide to importing and exporting data to and from R.
⁴http://cran.r-project.org/doc/manuals/r-release/R-exts.html
This is a guide to extending R, describing the process of creating R add-on packages,
writing R documentation, R’s system and foreign language interfaces, and the R API.
⁵http://cran.r-project.org/doc/manuals/r-release/R-admin.html
This is a guide to installation and administration for R.
⁶http://cran.r-project.org/doc/manuals/r-release/R-ints.html
This is a guide to the internal structures of R and coding standards for the core team
working on R itself.
⁷http://cran.r-project.org/doc/manuals/r-release/R-lang.html
This is an introduction to the R language, explaining evaluation, parsing, object oriented
programming, computing on the language, and so forth.
1.15 Exercises
 Explain Data Analytics and Sources of Big Data.
 How to import Data in R and explain steps for data cleansing.
 What are Data Structures used in R.
 How to Quit R Session?
54
Statistics with R Programming
 What is R Script Editor and enlist few IDEs for using R.
 Compare R and Python programming languages for Predictive Modeling.
 The iris dataset has different species of flowers such as Setosa, Versicolor
and Virginica with their sepal length. Now, we want to understand the
distribution of sepal length across all the species of flowers. One way to do
this is to visualize this relation through the graph shown below.
A) Which function cab be used to produce the graph shown below?
i)
xyplot()
ii)
stripplot() iii) barchart() iv)
bwplot()
 Below is the given sample function. Consider it and answer the following
question.
f <- function(x) {
g <- function(y) {
y + z
}
z <- 4
x + g(x)
}
55
Statistics with R Programming
A) If we execute following commands (written below), what will be the
output?
i)
ii)
iii)
iv)
12
7
4
16
56
CHAPTER
02
Introduction to R Programming
LEARNINGOBJECTIVES
In this chapter you will learn:
 About Variables and Data types used in R
 How Vectors and Matrices are defined in R
 About Control Statements and loops
 How to use cbind and rbind function
2.1 Introduction to R Programming
R language is an open source program maintained by the R core-development team. It is a
team of volunteer developers from across the globe. R language is massively used for
performing statistical operations. It is available on the R-Project website which is hosted at
https://www.r-project.org/ . R actually is a derived tool of S dialect that was created at
AT&T Bell Laboratories by Rick Becker, John Chambers and Allan Wilks. Forms of ‘R’
are accessible, at no expense, for 32-bit variants of Microsoft Windows for Linux, for
UNIX and for Macintosh OS X. (There are more established variants of R that help 8.6 and
9.) It is accessible through the Comprehensive R Archive Network (CRAN).
Statistics with R Programming
The reference for John Chambers' 1998 Association for Computing Machinery Software
grant expressed that S has everlastingly changed how individuals break down, imagine and
control information." The R venture develops the thoughts and bits of knowledge that
created the S dialect.
Here are focuses that potential clients may note:
R has broad and great designs capacities that are firmly connected with its diagnostic
capacities. The R framework is growing quickly. New highlights and capacities seem at
regular intervals. Basic figuring and examinations can be dealt with clearly. In the event
that basic techniques demonstrate lacking, there can be plan of action to the enormous
scope of further developed capacities that R offers. Adjustment of accessible capacities
permits considerably more prominent adaptability.
Since R is free, clients have no privilege to expect consideration, on the R-encourage list or
somewhere else, to questions. Be thankful for whatever assistance is given. Clients who
need a point and snap interface ought to examine the R Commander (Rcmdr bundle)
interface.
While R is as dependable as any measurable programming that is accessible, and presented
to higher guidelines of examination than most different frameworks, there are traps that
call for unique consideration. A portion of the model fitting schedules are driving edge,
with a constrained custom of experience of the impediments and entanglements. Whatever
the factual framework, and particularly when there is some component of entanglement,
check each progression with consideration.
The abilities required for the registering are not without anyone else enough. Neither R nor
some other measurable framework will give the measurable mastery expected to utilize
advanced capacities, or to know when gullible strategies are lacking. Anybody with an
opposite view may mind to think about whether a butcher's meat-cutting aptitudes are
prone to be sufficient for successful creature (or perhaps human!) medical procedure.
58
Statistics with R Programming
Involvement with the utilization of R is be that as it may, more than with most
frameworks, prone to be an instructive affair.
Hurrah for the R advancement group!
Why learn R Programming
R programming dialect is best for measurable, information investigation and machine
learning. By utilizing this dialect we can make questions, capacities, and bundles. We can
utilize it anyplace. It's stage autonomous, so we can apply it to all working framework. It's
free, so anybody can introduce it in any association without acquiring a permit.
2.2 Variables in R
Variables are like placeholders for values in programming language. Well we will now
discuss variables and constants in R. And will be able to learn best practices for using a
variable in the program.
2.2.1 What is Variable?
Variables are used to store data, whose value can be changed according to our need.
Unique name given to variable (function and objects as well) is identifier.
 Rules for writing Identifiers in R
4. Identifiers can be a combination of letters, digits, period (.) and underscore (_).
5. It must start with a letter or a period. If it starts with a period, it cannot be followed
by a digit.
6. Reserved words in R cannot be used as identifiers.
 Valid Identifiers in R
total, Sum, .fine.with.dot, this_is_variable, Number4
59
Statistics with R Programming
 Invalid Identifiers in R
tot@l, 5um,
_not-a-variable,
TRUE,
.0ne
2.2.2 Assigning values to Variables
We assign out qualities to factors with the task administrator "=". Simply composing
the variable without anyone else at the prompt will print out the esteem. We should
take note of that another type of task administrator "<-" is likewise being used.
>x=1
>x
[1] 1
2.2.3 Good Practices
Earlier versions of R used underscore (_) as an assignment operator. So, the period (.) was
used extensively in variable names having multiple words.
Current versions of R support underscore as a valid identifier but it is good practice to use
period as word separators.
For example, a.variable.name is preferred over a_variable_name or alternatively we could
use camel case as aVariableName
2.2.4 Creating new Variables
By using the assigning operator <- , one can assign values to variables in R or creating
variables.
New variable can also be just declared without being assigned any value and could be
assigned a value later.
60
Statistics with R Programming
# example of creating new variables
adata$sum <- adata$x1 + adata$x2
adata$mean <- (adata$x1 + adata$x2) /2
adata <- transform( adata,
sum = x1 + x2,
mean = (x1 + x2)/2
)
Can we Rename Variables
Yes you are able to rename variables with programmatically or interactively.
# rename interactively
fix(mydata) # results are saved on close
# rename programmatically
library(reshape)
mydata <- rename(mydata, c(oldname="newname"))
2.3 Scalars in R
R supports wider list of data types and objects including Scalars, Lists, Matrices, Data
Frames and Vectors. Information is the most fundamental fixings utilized in "information
examination". R underpins a wide assortment of information composes including scalars,
61
Statistics with R Programming
vectors, lattices, information casings, and records. In this section, one can go over some
generally utilized information composes and quickly cover questions at last.
In this era of programming, scalar alludes to a nuclear amount that can hold just a single an
incentive at any given moment. Scalars are the most fundamental information composes
that can be utilized to build more unpredictable ones. We should investigate some normal
sorts of scalars with basic R directions.
Number
> x <- 2
> y <- 1.5
> class(x)
[1] "numeric"
> class(y)
[1] "numeric"
> class(y+x)
[1] "numeric"
Logical Value
> m <- x > y
> n <- x < y
>m
[1] FALSE
>n
[1] TRUE
> class(m)
[1] "logical"
> class(NA)
Values
[1] "logical"
# Is x larger than y?
# Is x smaller than y?
# NA is another logical value: 'Not Available'/Missing
62
Statistics with R Programming
There are few more logical operators you may want to try.
>m&n
# AND
[1] FALSE
>m|n
# OR
[1] TRUE
> !m
# Negation
[1] TRUE
Character (string)
> a <- "1"; b <- "2.5"
# Are they different from x and y we used earlier?
> a;b
[1] "2"
[1] "3.5"
> a+b
# a+b=5.5?
Error in a + b : non-numeric argument to binary operator
> class(a)
[1] "character"
> class(as.numeric(a))
# but you can coerce this character into a number
[1] "numeric"
> class(as.character(x)) # vice resa
[1] "character"
63
Statistics with R Programming
2.4 Data Types in R
Data structure can be outlined as the specific form of organizing and storing the data. R
programming supports five basic types of data structure namely vector, matrix, list, data
frame and factor specific variety of organizing and storing the information. R
programming supports five basic sorts of organization specifically vector, matrix, list,
knowledge frame and issue. We will be discussing these knowledge structures and
therefore the thanks to write these in R Programming.
R has variety of data structures to dive into. Let us see some of them:
2.4.1 Vectors
Vectors are basic R type. A vector is a grouping of information components of a similar
essential compose. Members in a vector are authoritatively called components. By and by,
we will simply call them members in this site.
64
Statistics with R Programming
Here is a vector containing three numeric qualities 2, 3 and 5.
> c(2, 3, 5)
[1] 2 3 5
Vector Logical Values are:
> c(TRUE, FALSE, TRUE, FALSE, FALSE)
[1] TRUE FALSE TRUE FALSE FALSE
String can be included in Vector
> c("aa", "bb", "cc", "dd", "ee")
[1] "aa" "bb" "cc" "dd" "ee"
Unexpectedly, the quantity of members in a vector is given by the length
work.
> length(c("aa", "bb", "cc", "dd", "ee"))
[1] 5
 Combining Vectors
Vectors can be added together by the function c. For instance, the below two vectors n and
s are combined into a new vector containing elements from both vectors.
> n = c(2, 3, 5)
> p = c("aa", "bb", "cc", "dd", "ee")
> c(n, p)
65
Statistics with R Programming
[1] "2" "3" "5" "aa" "bb" "cc" "dd" "ee"
Number arithmetic tasks of vectors are performed part by-part, i.e., member wise.
For instance, assume we have two vectors namely a and b.
> a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)
After that, if we make product by 5, we would get a vector with each of its members
multiplied by 5.
>5*a
[1] 5 15 25 35
What's more, on the off chance that we gather an and b into a single unit, the aggregate
would be a vector whose individuals are the entirety of the comparing individuals from an
and b.
>a+b
[1] 2 5 9 15
2.4.2 Matrices
Matrix is a two-dimensional (2D) data structure that can be created using matrix () function
in R programing language. The values for rows columns can be defined using nrow and
ncol arguments. However providing both is not required as other dimension is
66
Statistics with R Programming
automatically taken with the help of length of matrix. All columns in a matrix must have
the same mode (numeric, character, etc.) and the same length. The general format is
mymatrix <- matrix(vector, nrow=r, ncol=c, byrow=FALSE,
dimnames=list(char_vector_rownames, char_vector_colnames))
byrow=TRUE indicates that the matrix should be filled by rows. byrow=FALSE indicates
that the matrix should be filled by columns (the default). dimnames provides optional
labels for the columns and rows.
# generates 5 x 4 numeric matrix
y<-matrix(1:20, nrow=5,ncol=4)
# another example
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
dimnames=list(rnames, cnames))

Constructing a Matrix in R
There are different approaches to develop a grid. When we develop a framework
specifically with information components, the network content is filled along the segment
introduction naturally. For instance, in the accompanying code bit, the substance of B is
filled along the sections successively.
67
Statistics with R Programming
> B = matrix(
+
c(2, 4, 3, 1, 5, 7),
+
nrow=3,
+
ncol=2)
> B
# B has 3 rows and 2 columns
[1,]
[2,]
[3,]

[,1] [,2]
2
1
4
5
3
7
Transpose
One can set up a transpose of a matrix by exchanging the position of its column and rows
with the function t.
> t(B)
# transpose of B
[,1] [,2] [,3]
[1,] 2 4 3
[2,] 1 5 7

Merging of Matrices
The segments of two lattices having a similar number of lines can be joined into a bigger
grid. For instance, assume we have another framework C additionally with 3 columns.
> D = matrix(
+ c(7, 4, 2),
+ nrow=3,
68
Statistics with R Programming
+ ncol=1)
>D
# D has 3 rows
[,1]
[1,] 7
[2,] 4
[3,] 2
Then we can combine the columns of B and C with cbind.
> cbind(B, C)
[,1] [,2] [,3]
[1,] 2 1 7
[2,] 4 5 4
[3,] 3 7 2
Correspondingly, we can join the lines of two networks in the event that they have a
similar number of sections with the rbind work.
> E = matrix(
+ c(6, 1),
+ nrow=1,
+ ncol=2)
>E
# E has 2 columns
69
Statistics with R Programming
[,1] [,2]
[1,] 6 2
> rbind(B, D)
[,1] [,2]
[1,] 2 1
[2,] 4 5
[3,] 3 7
[4,] 6 2

Deconstruction
Matrix can be deconstructed by applying the c function.
> c(B)
[1] 2 4 3 1 5 7
2.4.3 What is List in R
This data structure named List is list like structure virtually including data of different
types. It is similar to vector but a vector contains similar data but list contains mixed data.
A list is created using list(). An ordered collection of objects (components) is a list that
allows you to gather a variety of (possibly unrelated) objects under one name.
# example of a list with 4 components # a string, a numeric vector, a matrix, and a scaler
w <- list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
# example of a list containing two lists
v <- c(list1,list2)
70
Statistics with R Programming
A list is a conventional vector containing different items.
For instance, the accompanying variable x is a list containing duplicates of three vectors n,
s, b, and a numeric esteem 3.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
> x = list(n, s, b, 3) # x contains copies of n, s, b

Slicing List in R
We recover a list slice with the single square section "[]" administrator. Coming up next is
a slice containing the second individual from x, which is a duplicate of s.
> x[2]
[[1]]
[1] "aa" "bb" "cc" "dd" "ee"
With an index vector, we can recover a cut with various individuals. Here a cut containing
the second and fourth individuals from x.
71
Statistics with R Programming
> x[c(2, 4)]
[[1]]
[1] "aa" "bb" "cc" "dd" "ee"
[[2]]
[1] 3

Reference of Members
With the end goal to reference a list member straightforwardly, we need to utilize the
twofold square section "[[]]" administrator. The accompanying item x[[2]] is the second
member of x. At the end of the day, x[[2]] is a duplicate of s, however isn't a cut containing
s or its duplicate.
> x[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
> x[[2]][1] = "ta"
[1] "ta" "bb" "cc" "dd" "ee"
>s
[1] "aa" "bb" "cc" "dd" "ee" # s is unaffected
2.4.4 Data Frames
72
Statistics with R Programming
This data structure is a special case of list where each component is of same length. Data
frame is created using frame() function.
For instance consider below:
> x <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora"))
> str(x) # structure of x
'data.frame': 2 obs. of 3 variables:
$ SN : int
1
$ Age : num 21
2
15
$ Name: Factor w/ 2 levels "Dora","John": 2 1
A data frame is utilized for putting away data tables. It is a rundown of vectors of
equivalent length. For instance, the accompanying variable df is a data frame containing
three vectors n, s, b.
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc")
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b)

# df is a data frame
Pre-defined Data Frame
We can utilize to work in data frames in R for our instructional exercises. For instance,
here is a worked in data frame in R, called mtcars
> mtcars
mpg cyl disp hp drat wt ...
73
Statistics with R Programming
Mazda RX4
22.0 6 160 110 3.90 2.62 ...
Mazda RX4 Wag 22.0 6 160 110 3.90 2.88 ...
Datsun 710 21.8 4 108 93 3.85 2.32 ...
The best line of the table, called the header, contains the section names. Every level line a
short time later indicates a data push, which starts with the name of the column, and
afterward pursued by the real data. Every datum individual from a line is known as a cell.
To recover data in a cell, we would enter its line and section arranges in the single square
section "[]" administrator. The two directions are isolated by a comma. At the end of the
day, the directions starts with line position, at that point pursued by a comma, and finishes
with the segment position. The request is imperative.
Here is the cell esteem from the principal push, second section of mtcars.
> mtcars[1, 2]
[1] 6
Besides, we can utilize the line and segment names rather than the numeric directions.
> mtcars["Mazda RX4", "cyl"]
[1] 6

Data Frame Column Vector
We reference a data frame segment with the twofold square section "[[]]" administrator.
For instance, to recover the ninth segment vector of the inherent data set mtcars, we
compose mtcars[[9]].
> mtcars[[9]]
74
Statistics with R Programming
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
We may also retrieve the exact same column vector by its name.
> mtcars[["am"]]
[1] 1 1 1 0 0 0 0 0 0 0 0 ...
2.5 Arrays
We use to call array in R Programming basically called the multi-dimensional Data
structure. In this, information is put away as lattices, push, and also in sections. We can
utilize framework level, push file, and section list to get to the lattice components.
 R Array
They are the information objects which can store information in excess of two
measurements. An array is made utilizing the array() work. We can utilize vectors as
information. To make an array we can utilize this qualities in the diminish parameter.
Information about vector functions is also necessary with the concept of vectors. So, to
learn this you can follow the below-mentioned link:
For Example:
In this below example, we have created an R Array of two 3×3 matrices each with 3 rows
and 3 columns.
# create two vectors of different lengths.
vector1 <- c(2,9,3)
75
Statistics with R Programming
vector2 <- c(10,16,17,13,11,15)
Further these vector have been taken into array.
# take these vectors as input to the array.
result <- array(c(vector1,vector2),dim = c(3,3,2))
print(result)
When the above code is run or executed:
> [1]
[,1] [,2] [,3]
[1,]
2
10
13
[2,]
9
16
11
[3,]
3
17
15
> [2]
[,1] [,2] [,3]
[1,]
2 10
13
[2,]
9 16
11
[3,]
3 17
15
Arrays are identical to matrices but can have more than two dimensions even more than 5
dimensions but usually those are complex to handle.
76
Statistics with R Programming
2.6 Classes and Objects
In R Language there are three classes. We will now cover these three classes named S3, S4
and reference Class.
We can do protest arranged programming in R. Truth be told, everything in R is a question.
A protest is an information structure having a few traits and strategies which follow up on
its qualities.
Class is an outline for the question. We can consider class like an outline (model) of a
house. It contains every one of the insights about the floors, entryways, windows and so
on. In light of these depictions we construct the house.
House is the protest. As, numerous houses can be produced using a portrayal, we can make
numerous articles from a class. A protest is additionally called an example of a class and
the way toward making this question is called instantiation.
While most programming dialects have a solitary class framework, R has three class
frameworks. To be specific, S3, S4 and all the more as of late Reference class frameworks.
They have their very own highlights and eccentricities and picking one over the other
involves inclination. Underneath, we give a short prologue to them.
 S3 Class
S3 class is to some degree crude in nature. It does not have a formal definition and protest
of this class can be made essentially by adding a class ascribe to it.
This straightforwardness represents the way that it is broadly utilized in R programming
dialect. Truth be told the vast majority of the R worked in classes are of this compose.
See R programming S3 Class segment for further points of interest.
> # create a list with required components
> s <- list(name = "John", age = 21, GPA = 3.5)
> # name the class appropriately
> class(s) <- "student"
The code snippet above creates a S3 class with the given list.
77
Statistics with R Programming
 S4 Class in R
S4 class is an enhancement over the S3 class. They have a formally characterized structure
which helps in making the object of a similar class look pretty much comparative.
Class parts are appropriately characterized utilizing the setClass() capacity and articles are
made utilizing the new() work.
Not at all like S3 classes and questions which needs formal definition, we take a gander at
S4 class which is stricter as in it has a formal definition and a uniform method to make
objects.
This adds security to our code and keeps us from coincidentally committing gullible errors.
Snippet of S4 Class
< setClass("student", slots=list(name="character", age="numeric", GPA="numeric"))
 Reference Class
Reference classes were presented later, contrasted with the other two. It is more like the
protest arranged programming we are accustomed to seeing in other significant
programming dialects.
Reference classes are fundamentally S4 classed with a domain added to it.
Reference class in R writing computer programs is like the protest arranged programming
we are accustomed to finding in like manner dialects like C++, Java, Python and so on.
Not at all like S3 and S4 classes, have strategies had a place with class instead of
nonexclusive capacities. Reference classes are inside actualized as S4 classes with a
domain added to it.
78
Statistics with R Programming
< setRefClass("student")
 Comparison of S3, S4 and Reference Classes
Let us see some difference between S3, S4 and Reference Classes.
S3 Class
S4 Class
Referene Class
Needs formal definition
Class characterized
utilizing setClass()
Class characterized utilizing
setRefClass()
Objects are made by setting
the class property
Objects are created
using new()
Objects are created using
generator functions
Attributes are accessed using $
Attributes are accessed
using @
Attributes are accessed using $
Methods belong to generic
function
Methods belong to generic
function
Methods belong to the class
Follows copy-on-modify
semantics
Follows copy-on-modify
semantics
Does not follow copy-on-modify
semantics
2.7 R Programming Structures
In this section we will figure out how to compose the most straightforward program of
stating "Hi World" in R programming dialect.
In this program we will utilize work print() to show this string.
79
Statistics with R Programming
Normally the string will be shown with twofold statements. Anyway with the end goal to
maintain a strategic distance from that put, quote=FALSE.
R is a block-structure dialect in the way of the ALGOL-relative family, for example, C,
C++, Python, Perl, et cetera. As you've seen, squares are depicted by props, however
supports are discretionary if the square comprises of only a solitary proclamation.
Explanations are isolated by newline characters or, alternatively, by semicolons.
2.7.1 R Control Statements
These enable you to control the stream of execution of a content regularly within a
capacity. Basic ones include:

if, else

for

while

repeat

break

return

next
80
Statistics with R Programming
These are not directly utilized but these are used while working with R intuitively but
instead inside capacities.
If
if (condition) {
# do something
} else {
# do something else
}
For
A for loop deals with an iterative variable and allots progressive qualities till the finish of a
grouping.
for (i in 1:11) {
print(i)
}
x <- c("apples", "oranges", "bananas", "strawberries")
for (i in x) {
print(x[i])
}
for (i in 1:4) {
print(x[i])
}
for (i in seq(x)) {
print(x[i])
}
for (i in 1:4) print(x[i])
Nested For Loops
81
Statistics with R Programming
m <- matrix(1:10, 2)
for (i in seq(nrow(m))) {
for (j in seq(ncol(m))) {
print(m[i, j])
}
}
While Loop
i <- 1
while (i < 5) {
print(i)
i <- i + 1
}
Make assured that there is a way to exit out of a while loop.
Break and Repeat
repeat {
# simulations; generate some value have an expectation if within some range,
# then exit the loop
if ((value - expectation) <= threshold) {
break
}
}
Next Statement
for (i in 1:20) {
if (i%%2 == 1) {
next
82
Statistics with R Programming
} else {
print(i)
}
}
This loop will just print even numbers and skirt odd numbers. Later we'll learn different
capacities that will enable us to evade these sorts of moderate control streams however
much as could reasonably be expected (for the most part the while and for loops).
Return Statement
A couple of times, one does require functions to do some processing and provide the
output back to the result screen. This is accomplished with the return() statement in R.
Syntax:
return(expression)
R return() Statement Example:
check <- function(x) {
if (x > 0) {
result <- "Positive"
} else if (x < 0) {
result <- "Negative"
} else {
result <- "Zero"
}
return(result)
}
Below are given some sample run snippets
> check(1)
[1] "Positive"
> check(-10)
[1] "Negative"
> check(0)
[1] "Zero"
83
Statistics with R Programming
2.8 R Arithmetic Operators
In R programming you can perform basic math operations like most programming
languages. Below are the functions:
The R Arithmetic operators incorporate operators like Arithmetic Addition, Subtraction,
Division, Multiplication, Exponent, Integer Division and Modulus. Every one of these
operators are twofold operators, which implies they work on two operands. Beneath table
demonstrates all the Arithmetic Operators in R Programming dialect with precedents.
 Arithmetic Operations:
Operator
Description
+
Addition
-
Subtraction
*
Multiplication
/
Division
^
Exponent
%%
Modulus
%/%
Integer Division
Below is sample of arithmetic operations performed:
> x <- 5
> y <- 16
> x+y
[1] 21
> x-y
[1] -11
> x*y
[1] 80
84
Statistics with R Programming
 Logical Operators in R
All the logical operations in R language are performed using the below mentioned
operators.
Operator
Description
!
Logical NOT
||
Logical OR
|
Element-wise Logical OR
85
Statistics with R Programming
&&
Logical AND
&
Element-wise logical AND
Operators ‘&’ and ‘|’ perform element-wise operation generating result having length of
the longer operand.
And operators ‘&&’ and ‘||’ checks only the first element of the operands resulting into a
single length logical vector.
Zero is thought as a FALSE and non-zero numbers are taken as TRUE. A sample code:
> x <- c(TRUE,FALSE,0,6)
> y <- c(FALSE,TRUE,FALSE,TRUE)
> !x
[1] FALSE
TRUE
TRUE FALSE
> x&y
[1] FALSE FALSE FALSE
TRUE
> x&&y
[1] FALSE
> x|y
2.8.1
Values
[1] TRUE
TRUE FALSE TRUE
> x||y
For the default strategy, a network joining the ... arguments segment insightful or push
shrewd.
(Exception: if there are no sources of info or every one of the information sources
[1] TRUE
are NULL, the esteem is NULL.)
The sort of a network result decided from the most astounding kind of any of the
contributions to the order crude < intelligent < whole number < twofold < complex <
character < list .
For cbind (rbind) the segment (push) names are taken from the colnames (rownames) of
the arguments if these are lattice like. Generally from the names of the arguments or where
86
Statistics with R Programming
those are not provided and deparse.level > 0, by deparsing the articulations given, for
deparse.level = 1 just if that gives a sensible name (an 'image', see is.symbol).
For cbind push names are taken from the principal contention with fitting names:
rownames for a lattice, or names for a vector of length the quantity of columns of the
outcome.
For rbind segment names are taken from the main contention with suitable names:
colnames for a network, or names for a vector of length the quantity of segments of the
outcome.
2.9 Functions in R
One of the most effective ways in which to enhance your reach as an information soul is to
write down functions. Functions permit you to change common tasks during a lot of
powerful and general manner than copy-and-pasting. Writing a function has 3 huge
blessings over victimization copy-and-paste:
4. You can give a function a memorable or describing name that makes your code
easier to understand.
5. As the requirements may change, you only need to update the code in one place,
instead of many.
6. You reduce the chance of making incidental errors when you copy and paste
(i.e. updating a variable name in one place, but not in another).
Writing smart functions could be a time period journey. Even once victimization R for
several years I still learn new techniques and higher ways that of approaching recent
issues. The goal of this chapter isn't to show you each abstruse detail of functions however
to urge you started with some pragmatic recommendation that you simply will apply now.
87
Statistics with R Programming
As well as sensible recommendation for writing functions, this chapter additionally offers
you some suggestions for a way to vogue your code. Smart code vogue is like correct
punctuation.
You can manage without it but however it certain makes things easier to read. Such types
of punctuation, there are several unit attainable variations. Here we have a tendency to gift
the design we have a tendency to use in our code, however the foremost vital issue is to be
consistent.
 Function Definition in R
An R function is generated by using the keyword function. The basic syntax of an R
function definition is as follows −
function_name <- function(arg_1, arg_2, ...) {
Function body
}
 Components of Function in R
There are different components in R Function
5. Name of the Function – This is the actual name of the function. It is stored in R
environment as an object with this name.
6. Arguments − an argument or parameter is a placeholder for values to pass to
function while calling the function. When a function is invoked, you pass a value to
the argument. Arguments are optional; that is, a function may contain no
arguments. Also arguments can have default values.
88
Statistics with R Programming
7. Function Body − the function body contains a collection of statements that defines
what the function does.
8. Return Value − the return value of a function is the last expression in the function
body to be evaluated.
R has many pre-built performing functions which can be directly invoked in the program
without defining them first. We can also create and use our own functions referred as user
defined or user declared functions.
There are some General Functions in R
builtins() # List all built-in functions
options() # Set options to control how R computes & displays results
abs(x)
# The absolute value of "x"
append() # Add elements to a vector
c(x)
# A generic function which combines its arguments
cat(x)
# Prints the arguments
cbind()
# Combine vectors by row/column (cf. "paste" in Unix)
diff(x)
# Returns suitably lagged and iterated differences
gl()
# Generate factors with the pattern of their levels
grep()
# Pattern matching
identical() # Test if 2 objects are *exactly* equal
jitter()
# Add a small amount of noise to a numeric vector
julian()
# Return Julian date
length(x)
ls()
# Return no. of elements in vector x
# List objects in current environment
mat.or.vec() # Create a matrix or vector
paste(x)
# Concatenate vectors after converting to character
89
Statistics with R Programming
range(x)
# Returns the minimum and maximum of x
rep(1,5)
# Repeat the number 1 five times
rev(x)
# List the elements of "x" in reverse order
Then there are some Statistical Functions in R
Statistical Functions
help(package=stats) # List all stats functions
?Chisquare
# Help on chi-squared distribution functions
?Poisson
# Help on Poisson distribution functions
help(package=survival) # Survival analysis
cor.test()
# Perform correlation test
cumsum(); cumprod(); cummin(); cummax() # Cumuluative functions for vectors
density(x)
# Compute kernel density estimates
ks.test()
# Performs one or two sample Kolmogorov-Smirnov tests
loess(), lowess()
mad()
# Scatter plot smoothing
# Calculate median absolute deviation
mean(x), weighted.mean(x), median(x), min(x), max(x), quantile(x)
rnorm(), runif() # Generate random data with Gaussian/uniform distribution
splinefun()
# Perform spline interpolation
smooth.spline()
sd()
summary(x)
# Fits a cubic smoothing spline
# Calculate standard deviation
# Returns a summary of x: mean, min, max etc.
t.test()
# Student's t-test
var()
# Calculate variance
sample()
ecdf()
# Random samples & permutations
# Empirical Cumulative Distribution Function
90
Statistics with R Programming
qqplot()
# quantile-quantile plot
2.9.1 Does R have Pointers?
No, Just like Java there are No Pointers in R.
R does not have variables comparing to pointers or references like those of, say, the C
dialect. This can make programming more troublesome at times. (As of this composition,
the present rendition of R has an exploratory element called reference classes, which may
diminish the trouble.)
For instance, you can't compose a capacity that straightforwardly changes its contentions.
In Python, for example, you can do this:
>>> x = [13,5,12]
>>> x.sort()
>>> x
[5, 12, 13]
Here, the estimation of x, the contention to sort(), changed. By difference, here's the way it
works in R:
> x <- c(13,5,12)
> sort(x)
[1] 5 12 13
91
Statistics with R Programming
>x
[1] 13 5 12
R is a factual examination bundle dependent on composing short contents or projects (as
opposed to being founded on GUIs like spread sheets or coordinated work process editors).
R is definitely not a "customary" programming dialect. In each script factors give methods
for getting to the information put away in memory. R does not give guide access to the
PC's memory yet rather gives various specific information structures (objects).
2.9.2 Recursion
R Recursion Function
A function that calls itself is known as a recursive function and this system is known as
recursion.
This unique programming system can be utilized to take care of issues by breaking them
into littler and less difficult sub-issues.
A precedent can help elucidate this idea.
Give us a chance to take the case of finding the factorial of a number. Factorial of a
positive whole number is characterized as the result of the considerable number of whole
numbers from 1 to that number. For instance, the factorial of 5 (indicated as 5!) will be
5! = 1*2*3*4*5 = 120
This issue of discovering factorial of 5 can be separated into a sub-issue of increasing the
factorial of 4 with 5.
5! = 5*4!
Or more simply,
92
Statistics with R Programming
n! = n*(n-1)!
Presently we can proceed with this until the point that we achieve 0! which is 1.
The usage of this is given underneath. Consider an instance example below of Recursion
used in R:
# Recursive function to find factorial
recursive.factorial <- function(x) {
if (x == 0) return (1)
else
return (x * recursive.factorial(x-1))
}
Here, we have a function which will call itself. Something like recursive.factorial(x) will
turn into x * recursive.factorial(x) until x becomes equal to 0.
At the point when x moves toward becoming 0, we return 1 since the factorial of 0 is 1.
This is the ending condition and is vital.
Without this the recursion won't end and proceed uncertainly (in principle). Here are some
example function calls to our function.
> recursive.factorial(0)
[1] 1
> recursive.factorial(5)
[1] 120
> recursive.factorial(7)
[1] 5040
93
Statistics with R Programming
The utilization of recursion, regularly, makes code shorter and looks clean. Be that as it
may, it is here and there difficult to finish the code rationale. It may be difficult to think
about an issue recursively.
Recursive functions are additionally memory escalated, since it can result into a
considerable measure of settled function calls. This must be remembered when utilizing it
for taking care of enormous issues.
2.10 Use of c and cbind in R
In R programming there are some in-built functions namely c, cbind, and rbind.
Let’s take a closer look at those.
C or c Function
Combine Values into A Vector Or List
This is a conventional function which joins its arguments.
The default technique joins its arguments to shape a vector. All arguments are constrained
to a typical sort which is the kind of the returned esteem, and all traits aside from names
are expelled.
Keyword – manip
Use
cbind(…, deparse.level = 1)
rbind(…, deparse.level = 1)
# S3 method for data.frame
rbind(…, deparse.level = 1, make.row.names = TRUE,
stringsAsFactors = default.stringsAsFactors())
Arguments
94
Statistics with R Programming
(summed up) vectors or grids. These can be given as named arguments. Other R items
might be pressured as proper, or S4 techniques might be utilized: see segments 'Points of
interest' and 'Esteem'. (For the "data.frame" technique for cbind these can be further
arguments to data.frame, for example, stringsAsFactors.)
Simplified
The functions cbind and rbind are S3 nonexclusive, with strategies for information
outlines. The information outline strategy will be utilized if no less than one contention is
an information outline and the rest are vectors or grids. There can be different techniques;
specifically, there is one for time arrangement objects. See the segment on 'Dispatch' for
how the technique to be utilized is chosen. On the off chance that a portion of the
arguments are of a S4 class, i.e., isS4(.) is valid, S4 techniques are looked for likewise, and
the concealed cbind/rbind functions from bundle strategies possibly called, which thus
expand on cbind2 or rbind2, separately. All things considered, deparse.level is complied,
also to the default technique.
cbind: Combine R Objects by Rows or Columns
Description
Take a succession of vector, matrix or information outline arguments and consolidate by
sections or columns, individually. These are nonexclusive functions with strategies for
other R classes.
Use
cbind(..., deparse.level = 1)
rbind(..., deparse.level = 1)
## S3 method for class 'data.frame'
rbind(..., deparse.level = 1, make.row.names = TRUE,
stringsAsFactors = default.stringsAsFactors())
95
Statistics with R Programming
Arguments
...
deparse.level
make.row.names
stringsAsFactors
(all summed up) vectors or networks. These can be given as named arguments.
Other R items might be constrained as fitting, or S4 techniques might be
utilized: see segments 'Points of interest' and 'Esteem'. (For the "data.frame"
strategy for cbind these can be further arguments to data.frame, for example,
stringsAsFactors.)
integer controlling the construction of labels in the case of non-matrix-like
arguments (for the default method):
deparse.level = 0 constructs no labels; the default,
deparse.level = 1 or 2 constructs labels from the argument names, see the
‘Value’ section below.
(just for information outline technique:) coherent showing if one of a kind and
substantial row.names ought to be developed from the arguments.
logical, passed to as.data.frame; only has an effect when the ... arguments
contain a (non-data.frame) character.
2.10.1 Rbind/ rbind Function
rbind() function is responsible for merging vector, matrix or data frame by rows.
rbind(x1,x2,….)
x1,x2:vector, matrix, data frames
dataset1.csv
Subtype
Gender
Expression
A
m
-0.54
A
f
-0.8
B
f
-1.03
C
m
-0.41
Gender
Expression
D
m
3.22
D
f
1.02
D
f
0.21
D
m
-0.04
dataset2.csv
Subtype
96
Statistics with R Programming
D
m
2.11
B
m
-1.21
A
f
-0.2
Read in the data from the file:
>x <- read.csv("data1.csv",header=T,sep=",")
>x2 <- read.csv("data2.csv",header=T,sep=",")
>x3 <- rbind(x,x2)
>x3
Subtype Gender Expression
1
A m -0.54
2
A f
-0.80
3
B f
-1.03
4
C m -0.41
5
D m
3.22
6
D f
1.02
7
D f
0.21
8
D m -0.04
9
D m
2.11
10
B m -1.21
11
A f
-0.20
The segment of the two datasets must be same, generally the blend will be insignificant.
2.10.2 R attach() and detach() Functions
The database is joined to the R seek way. This implies the database is sought by R while
assessing a variable, so questions in the database can be gotten to by just giving their
names.
attach() function makes the data available to the R Search Path.
attach(x) x: dataframe, matrix, list
97
Statistics with R Programming
Below file has been utilized for ANOVA analysis:
Subtype,Gender,Expression
A,m,-0.54
A,m,-0.8
A,m,-1.03
A,m,-0.41
A,m,-1.31
A,f,-0.66
A,m,-0.43
A,m,1.01
A,f,-1.15
Let first read in the data from the file:
>x <- read.csv("anova.csv",header=T,sep=",")
There are three header variables namely Expression, Gender and Subtype. We can display
the variables by:
>x$Gender
[1] m m m m m f m m f m m f m m m m f m m m m m m f m m m f m m m m f m m m m
[38] m m m m m m m m m f m f m m m m m f m m f m m f m m m m f m m m m m m m m
[75] m m f m m m m m f m m m m m m m m m f m m f m m f m f m m f m m f m m f m
[112] m f m m f m m m f m m m f m f m f f f f f f m f m f f f m f f f f m f m f
[149] m f f m f f f f f m f m f f m f f m f f m f f f m f f f m f f f m f f m f
[186] f f m f f m f m m f m f m f f m f f f f f m f f m f f f m m m f m m m f f
[223] f f f f f m m m f m f f m f f f m f f f m f f f f m f m f f f f m f f f m
[260] f f m f f f f f f m f f m f f f f f f m f f
Levels: f m
We are not using the variable Gender in R Search Path:
>gender
Error: object 'Gender' not found
After we apply attach() function to the object "x", "Gender" can be used internationally:
>attach(x)
>Gender
[1] m m m m m f m m f m m f m m m m f m m m m m m f m m m f m m m m f m m m m
[38] m m m m m m m m m f m f m m m m m f m m f m m f m m m m f m m m m m m m m
[75] m m f m m m m m f m m m m m m m m m f m m f m m f m f m m f m m f m m f m
[112] m f m m f m m m f m m m f m f m f f f f f f m f m f f f m f f f f m f m f
[149] m f f m f f f f f m f m f f m f f m f f m f f f m f f f m f f f m f f m f
[186] f f m f f m f m m f m f m f f m f f f f f m f f m f f f m m m f m m m f f
[223] f f f f f m m m f m f f m f f f m f f f m f f f f m f m f f f f m f f f m
98
Statistics with R Programming
[260] f f m f f f f f f m f f m f f f f f f m f f
Levels: f m
detach() function reverses the process:
>detach(x)
>Gender
Error: object 'Gender' not found
Factors in R
Indicate R that a variable is nominal by declaring it as a factor. The factor stores the
nominal values as a vector of integers in the range [ 1... k ] (where k is the number of
unique values in the nominal variable), and an internal vector of character strings (the
original values) mapped to these integers.
# variable gender with 20 "male" entries and
# 30 "female" entries
gender <- c(rep("male",20), rep("female", 30))
gender <- factor(gender)
# stores gender as 20 1s and 30 2s and associates
# 1=female, 2=male internally (alphabetically)
# R now treats gender as a nominal variable
summary(gender)
R factors variable is a vector of unmitigated information. factor() function makes a factor
variable, and figures the unmitigated conveyance of a vector information.
factor(x = character(), levels, labels = levels,
exclude = NA, ordered = is.ordered(x))
x: a vector of data
...
99
Statistics with R Programming
> v <- c(1,3,5,8,2,1,3,5,3,5)
> is.factor(v)
[1] FALSE
It will calculate the categorical distribution:
> factor(v)
[1] 1 3 5 8 2 1 3 5 3 5
Levels: 1 2 3 5 8
> x <- factor(v)
>x
[1] 1 3 5 8 2 1 3 5 3 5
Levels: 1 2 3 5 8
> is.factor(x)
[1] TRUE
Select levels:
> x <- factor(v, levels=c(2,1))
>x
[1] 1 <NA> <NA> <NA> 2 1 <NA> <NA> <NA> <NA>
Levels: 2 1
Change the level value:
> levels(x) <- c("two","one")
>x
[1] one <NA> <NA> <NA> two one <NA> <NA> <NA> <NA>
Levels: two one
Transforming Factors
Converting a factor from a number can cause problems:
100
Statistics with R Programming
f <- factor(c(3.4, 1.2, 5))
as.numeric(f)
[1] 2 1 3
This does not mean that it will behave as expected (and there isn’t warning). The
recommended way is to use the integer vector to index the factor levels.
101
Statistics with R Programming
2.11 Chapter Summary
Thus, in this R programming language instructional exercise, we have contemplated the
prologue to R programming in detail. R is free and open-source, making it workable for
anybody to approach world-class measurable examination instruments. It is utilized
broadly in the scholarly world and the private area and is the most famous factual
investigation programming dialect today. Learning R isn't simple — on the off chance that
it was, information researchers wouldn't be in such popularity. Be that as it may, there is no
deficiency of value assets you can use to learn R in case you're willing to invest the energy
and exertion.
Along these lines, it is obvious from the above data that R is more well-known and better
choice as R underpins an alternate sort of programming dialects. Additionally, R is an
Open source and has error less abilities and accessibility to different dialects.
2.12 Online Resources
1
https://data-flair.training/blogs/r-tutorial/
R tutorial | Introduction to R Programming – Features and Applications
2
http://cran.r-project.org/doc/FAQ/R-FAQ.html
The CRAN Web site hosts several documents, bibliographic resources, and links to other
sites.
3
http://www.R-project.org/mail.html
There are four discussion lists on R, to subscribe, send a message, or read the archives
4
https://www.rstudio.com/online-learning/
A wealth of tutorials, articles, and examples exist to help you learn R and its extensions.
102
Statistics with R Programming
2.13 Exercises
 Explain how you can start the R commander GUI?
 What is procedural programming in R?
 What are statistical software and data analysis in R?
 Enlist the various steps involved in Analytics project in R?
 Explain Mean, Median and Mode in R. How do perform Mean, Median and Mode
operations on statistical data in R?
 What is the recycling of elements in a vector? Give an example.
 Explain the following in terms of R:
i)
table
ii)
file
iii)
tree
 What are vectors in R? Construct the following vector in R:
a) (0.130.21 , 0.160.24 , . . . , 0.1360.234)
22 23
2 25
b) (2,
,
,...,
)
2
3
25
 What are different data structures in R. Briefly explain each of the following:
a) Vector
b) List
c) Matrix
d) Data frames
e) Arrays
103
Statistics with R Programming
 Consider the following matrix creation code. State whether it is True or False.
> matrix(1:9, nrow = 3, ncol = 3)
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
> # same result is obtained by providing only one dimension
> matrix(1:9, nrow = 3)
[,1] [,2] [,3]
[1,]
1
4
7
[2,]
2
5
8
[3,]
3
6
9
 Write a code to access elements of matrix defined in above code.
 Match the following R Functions with their appropriate uses.
Function
Performs the following operation
i)
append()
a) Combine vectors by row/column
ii)
cbind()
b) set or query graphical parameters
iii)
example()
c) plot matrix of scatter plots
iv)
pairs()
d) add elements to a vector
v)
par()
e) gives a demo()
 What are the different types of sorting algorithms available in R language?
104
Statistics with R Programming
CHAPTER
03
Data Manipulation in R
LEARNINGOBJECTIVES
3.1 String Manipulation
String manipulation comprises a set of functions used to extract information from texts. In
R we have some packages for string manipulation like stringr and stringi.
Let us explore some functions in stringr package.
strwrap: This function is used to wrap the string to format the paragraphs.
library(stringr)
# takes a string in variable string
string <- "Los Angeles, officially the City of Los Angeles and often known by its initials L.A.,
is the second-most populous city in the United States (after New York City), the most
populous city in California and the county seat of Los Angeles County. Situated in
Southern California, Los Angeles is known for its Mediterranean climate, ethnic diversity,
sprawling metropolis, and as a major center of the American entertainment industry."
strwrap(string)
105
Statistics with R Programming
[1] "Los Angeles, officially the City of Los Angeles and often known by its initials"
[2] "L.A., is the second-most populous city in the United States (after New York City),"
[3] "the most populous city in California and the county seat of Los Angeles County."
[4] "Situated in Southern California, Los Angeles is known for its Mediterranean"
[5] "climate, ethnic diversity, sprawling metropolis, and as a major center of the"
[6] "American entertainment industry."
str_len:
It is used to count the number of characters in the string.
str_len(string)
[1] 429
str_replace:
It is used to replace matched patterns in a string.
str_replace(string, pattern, replacement)
# string is the input vector
# pattern specifies the patterns to check
# replacement specifies a charactor vector of replacements
str_sub:
106
Statistics with R Programming
This function is used to extract substrings from a character vector.
str_sub(string, start, end)
# string is the input character vector
# start gives the index of the first character to be replaced
# end gives the index of the last character to be replaced
str_split:
This function is used to split the vector of strings following a delimiter.
str_split(x,sep)
# Input vector of strings
# sep specifies the delimiter
3.2 Data sorting
To sort a DataFrame in R, order( ) function can be used. By default, sorting
using order() function is in ASCENDING order. Prepending the sorting
variable by a minus sign (-) indicates sorting in DESCENDING order. Here are
some examples.
#
sorting examples using the “temp” dataset
attach(temp)
#
sort by id
newdata <- temp[order(id),]
107
Statistics with R Programming
#
sort by id and name
newdata <- temp[order(id, name),]
#
sort by id (ascending) and name (descending)
newdata <- temp[order(id, -name),]
detach(temp)
3.3 Dealing with missing values
We may come across the situation when the data is incomplete. In that
condition, incomplete data is represented as NA. To handle these incomplete
data we use functions like na.omit() and complete.cases().
These functions return the rows of a data frame which is free from missing
values.
complete.cases(data)
# return the rows that are complete
na.omit(data)
# returns rows free from missing information
To check if a row of a data frame is complete or not, we use function is.na().
It returns a logical value i.e. TRUE or FALSE.
108
Statistics with R Programming
is.na(x)
# returns either TRUE or FALSE
# here x is the object to which we want to check
3.4 Find and remove duplicates record
R provides two base functions to find and remove duplicates from record:
• duplicated(): To identify duplicate elements.
• unique(): To extract unique elements.
Example:
Given the following vector:
x <- c(2, 1, 5, 5, 4, 6)
To find the position of duplicate elements in x, use this:
duplicated(x)
## [1] FALSE
FALSE TRUE TRUE FALSE FALSE
Extract duplicate elements:
x[duplicated(x)]
109
Statistics with R Programming
## [1] 5
3.5 Cleaning data
R provides many useful commands to clean data. Some of them are listed
below:
● sub(): replaces the first occurrence in DataFrame.
● gsub(): replaces all occurrences in DataFrame.
2.13.1.1
Quantitative Variables in Ranges
● cut(data$col, seq(0,100, by=10)):
Breaks the data by the range it falls into.
110
Statistics with R Programming
● cut2(data$col, g=6):
It returns a factor variable with 6 groups.
● cut2(data$col, m=25):
It returns a factor variable with at least 25 observations in each group.
2.13.1.2
Manipulating Rows/Columns
● merge():
For combining data frames.
● sort():
It sorts an array.
● order(data$col, na.last=T):
It returns indexes for the ordered row.
● data[order(data$col, na.last=T),]:
It reorders the entire DataFrame based on columns.
● melt():
In the reshape2 package, it is used for reshaping data.
● rbind():
111
Statistics with R Programming
It adds more rows to a DataFrame.
3.6 Recoding data
In R, it is possible to re-code an entire vector or array at once. For example,
let’s create a vector that has missing values.
A <- c(3, 2, NA, 5, 3, 7, NA, NA, 5, 2, 6)
[1] 3 2 NA 5 3 7 NA NA 5 2 6
Some re-coding tasks are very complex, if you want to re-code a categorical
variable. In such cases, you might want to re-code vector or an array with
character/string to numeric.
gender <- c("MALE","FEMALE","FEMALE","UNKNOWN","MALE")
gender
[1] "MALE" "FEMALE" "FEMALE" "UNKNOWN" "MALE"
3.7 Merging Data Frames
112
Statistics with R Programming
2.14 Adding Columns
To merge two data frames horizontally, merge() function is used in R. In
most cases, we join two data frames by one or more common key variables.
#
merge two data frames by ID
total <- merge(data_Frame_A, data_frame_B, by="ID")
#
merge two data frames by ID and Country
total <- merge(data_frame_A, data_frame_B, by=c("ID","name"))
2.15 Adding Rows
To join two data frames vertically, in R we rbind() function is used. The two
data frames must have the same variables, order may be different.
total <- rbind(data_frame_A, data_frame_B)
3.8 Slicing of Data
R has powerful indexing features for accessing object elements. The
following code snippets demonstrate ways to keep or delete and
observations and to take random samples from a dataset.
2.16
Selecting (Keeping) Variables
113
Statistics with R Programming
#
select variables v1, v2, v3
vars <- c("v1", "v2", "v3")
newdata <- data[vars]
#
another method
vars <- paste("v", 1:3, sep="")
newdata <- data[vars]
#
select 1st and 5th thru 10th variables
newdata <- data[c(1,5:10)]
3.9 Renaming Columns and Rows
2.17 Renaming Columns
2.18 To rename a column in a DataFrame colnames() function is used.
#changes column name “old.name” to “new.name”
colnames(data)[colnames(data)=="old.name"] <- “new.name"
colnames() can also be used to remove column names.
114
Statistics with R Programming
# we remove column names by making them NULL
colnames(data) <- NULL
2.19 Renaming Rows
To rename a row in a DataFrame rownames() function is used.
# changes row name “old.name” to “new.name”
rownames(data)[rownames(data)=="old.name"] <- “new.name"
rownames() can also be used to remove column names.
# we remove row names by making them NULL
rownames(data) <- NULL
3.10 Adding and Replacing Columns to Data Frames
To add a new column to a data frame we make use of $ operator as
my.dataframe$new.col <- a.vector
# my.dataframe is the name of the data frame
# new.col is the name of the new column
115
Statistics with R Programming
# a.vector is the input vector to the new column
Alternatively, we can also add column in the following way
my.dataframe[“new.col”] <- a.vector
To replace a column follow these steps
my.dataframe$prev.col <- NULL
my.dataframe$new.col <- a.vector
Or if we don’t want to change the column name
My.dataframe$prev.col <- a.vector
3.11 Apply functions
Apply
Apply can be used to apply a function to a matrix. For example, let’s create a
sample dataset:
116
Statistics with R Programming
data <- matrix(c(1:10, 21:30), nrow = 5, ncol =
4)
data
[,1] [,2] [,3] [,4]
[1,]
1
6
21
26
[2,]
2
7
22
27
[3,]
3
8
23
28
[4,]
4
9
24
29
[5,]
5
10
25
30
Now apply() function can be used to find the mean of each row as follows:
apply(data, 1, mean)
13.5 14.5 15.5 16.5 17.5
The second parameter is the dimension. Here, 1 signifies rows and 2 signifies
columns. If you want both, c(1, 2) can be used.
lapply
lapply is similar to apply, but it takes a list/array as an input, and returns a
list/array as output.
117
Statistics with R Programming
Let’s create a list:
data <- list(x = 1:5, y = 6:10, z = 20:25)
data
$x
1 2 3 4 5
$y
6
7
8
9 10
$z
20 21 22 23 24 25
Now, apply can be used to apply a function to each element in the list. For
example:
lapply(data, FUN=median)
$x
[1] 3
$y
[1] 8
$z
[1] 13
118
Statistics with R Programming
3.10 Doing Math and Simulation in R, Math Function
In R, more than just basic operators can be used. R comes with a huge set of
mathematical functions. All these functions are vectorized, so you can use
them on complete vectors.
Function
What It Does
abs(x)
Takes the absolute value of x
log(x,base=y)
Takes the logarithm of x with base y; if
base
is not specified, returns the natural
logarithm
exp(x)
Returns the exponential of x
sqrt(x)
Returns the square root of x
factorial(x)
Returns the factorial of x (x!)
choose(x,y)
Returns the number of possible
combinations when drawing
y elements at a time from x possibilities
119
Statistics with R Programming
In R, logarithm of the numbers from 1 to 3 can be calculated like this:
> log(1:3)
[1] 0.0000000 0.6931472 1.0986123
We calculate the logarithm of these numbers with any base. log with base 6
is given below:
> log(1:3,base=6)
[1] 0.0000000 0.3868528 0.6131472
sqrt() only works with real numbers in R. NaN is produced when -ve number
is given as input:
sqrt(-1)
##
Warning in sqrt(-1): NaNs produced
##
[1] NaN
3.12 Functions For Statistical Distribution
Normal distribution
There are four functions that can be used to generate the values associated
with the normal distribution. we can get a full list of them and their options
using the help command:
> help(Normal)
120
Statistics with R Programming
The first function is dnorm(). It returns the height of the probability
distribution at each point of given values. If you only give the points it
assumes you want to use a mean of zero and standard deviation of one.
There are options to use different values for mean and standard deviation,
though:
>
dnorm(0)
[1] 0.3989423
>
dnorm(0)*sqrt(2*pi)
[1] 1
>
dnorm(0,mean=4)
[1] 0.0001338302
>
dnorm(0, mean=4, sd=10)
[1] 0.03682701
>
v <- c(0, 1, 2)
>
dnorm(v)
[1]
0.39894228 0.24197072 0.05399097
>
x <- seq(-20, 20, by=.1)
>
y <- dnorm(x)
>
plot(x, y)
>
y <- dnorm(x, mean=2.5, sd=0.1)
>
plot(x,y)
121
Statistics with R Programming
The second function is pnorm(). It computes the probability that a normally
distributed random number will be less than that number given a list. This
function also goes by the rather ominous title of the “Cumulative
Distribution Function.” It accepts the same options as dnorm:
>
pnorm(0)
[1] 0.5
>
pnorm(1)
[1] 0.8413447
>
pnorm(0, mean=2)
[1] 0.02275013
>
pnorm(0, mean=2, sd=3)
[1] 0.2524925
>
v <- c(0, 1, 2)
>
pnorm(v)
[1] 0.5000000 0.8413447 0.9772499
>
x <- seq(-20, 20, by=.1)
>
y <- pnorm(x)
>
plot(x, y)
>
y <- pnorm(x, mean=3, sd=4)
>
plot(x, y)
122
Statistics with R Programming
3.13 Sorting, Linear Algebra Operation on Vectors and
Matrices
123
Statistics with R Programming
2.20 Matrix facilities
In the following examples, A and B are matrices and x and b are a vectors.
Operator or Function
Description
A*B
Element-wise multiplication
A %*% B
Matrix multiplication
A %o% B
Outer product. AB'
crossprod(A,B)
crossprod(A)
A'B and A'A respectively.
t(A)
Transpose
diag(x)
Creates diagonal matrix with elements
of x in the principal diagonal
diag(A)
Returns a vector containing the
elements of the principal diagonal
diag(k)
If k is a scalar, this creates a k x k
identity matrix. Go figure.
solve(A, b)
Returns vector x in the equation b = Ax
(i.e., A-1b)
solve(A)
Inverse of A where A is a square matrix.
124
Statistics with R Programming
ginv(A)
Moore-Penrose Generalized Inverse of
A.
ginv(A) requires loading the MASS
package.
y<-eigen(A)
y$val are the eigenvalues of A
y$vec are the eigenvectors of A
y<-svd(A)
Single value decomposition of A.
y$d = vector containing the singular
values of A
y$u = matrix with columns contain the
left singular vectors of A
y$v = matrix with columns contain the
right singular vectors of A
R <- chol(A)
Choleski factorization of A. Returns the
upper triangular factor, such that R'R =
A.
y <- qr(A)
QR decomposition of A.
y$qr has an upper triangle that
contains the decomposition and a
lower triangle that contains
information on the Q decomposition.
y$rank is the rank of A.
y$qraux a vector which contains
additional information on Q.
y$pivot contains information on the
pivoting strategy used.
125
Statistics with R Programming
cbind(A,B,...)
Combine matrices(vectors)
horizontally. Returns a matrix.
rbind(A,B,...)
Combine matrices(vectors) vertically.
Returns a matrix.
rowMeans(A)
Returns vector of row means.
rowSums(A)
Returns vector of row sums.
colMeans(A)
Returns vector of column means.
colSums(A)
Returns vector of column sums.
126
Statistics with R Programming
CHAPTER
04
Data Import Techniques in R
LEARNINGOBJECTIVES
4.1 Installing a Package
To install a package install.packages() function is used.
install.packages(“Package Name”)
4.2 Activating a Package
Activating a package is very easy in R. There are two ways to activate a
package. First method is activating packages directly from package menu.
Second method is using the library() function to activate a package.
library(“Package Name”)
To open the documentation of a particular package, use help parameter in
the library()
127
Statistics with R Programming
library(help = “Package Name”)
4.3 Built-In datasets
To access the built-in datasets we use data functions as
data(package = package.name)
# package.name is the name of the package from where we want to import our dataset
4.4 Reading Files
There are so many standards for storing data. Common formats are CSV,
JSON, XML, YAML etc.
Importing data in R is very simple. For Stata and Systat, foreign package is
used. Example of importing data are provided below.
From A Comma Delimited Text File
To import data from CSV file read.table and read.csv functions are used.
128
Statistics with R Programming
# header is the first row that contains variable names
# assign the variable id to row names
mydata <- read.table("data.csv", header=TRUE, sep=",",
row.names="id")
There is another way of reading a CSV file.
mydata <- read.csv(file = “data.csv”)
From An Excel File
To import data from excel file we use read.xlsx function.
library(xlsx)
mydata <- read.xlsx(file, sheetIndex, header=TRUE)
# file specifies the file path
# sheetIndex specifies the the index of the sheet to be read.
# header, if true first row is used as column names.
129
Statistics with R Programming
4.5 Writing Data
There are numerous methods for exporting R objects into other
formats. For
SPSS, SAS and Stata, you will need to load the foreign packages. For
Excel,
xlsReadWrite package can be used.
2.21 To a Tab Delimited Text File
write.table(mydata, "data.txt", sep="\t")
2.22 To an Excel Spreadsheet
library(xlsx)
write.xlsx(mydata, "data.xlsx")
4.6 Basic SQL queries in R
There are numerous ways to query data with R. This section shows you three
of the most common ways:
1. Using DBI
2. Using dplyr syntax
3. Using R Notebooks
130
Statistics with R Programming
Several recent package make it easier to use databases within R. The query
examples below demonstrate the capabilities of these R packages.
● DBI: The DBI specification has gone through many recent
improvement. When working with databases, one should always
use packages that are DBI-compliant.
● dplyr & dbplyr: dplyr package has a generalized SQL backend for
talking databases, and the new dbplyr package translates R code
into database-specific variants. SQL variants are supported for
the following databases: Microsoft SQL Server, Oracle,
PostgreSQL, Apache Hive, Amazon Redshift, and Apache
Impala.
● odbc: The odbc R package provides a way to connect to any
database as long as you have an ODBC driver installed. The odbc
R package is DBI-compliant, and is recommended for ODBC
connections.
Example: Query bank data in an Oracle database
In this example, we will query bank data in an Oracle database. We connect
to the database by using the DBI and odbc packages. This specific connection
requires a database driver and a data source name (DSN) that have both
been configured by the system administrator. Your connection might use
another method.
library(DBI)
131
Statistics with R Programming
library(dplyr)
library(dbplyr)
library(odbc)
con <- dbConnect(odbc::odbc(), "Oracle DB")
2.22.1
Query using DBI
You can query the data with DBI by using the dbGetQuery() function.
Simply
write SQL code into the R function as a quoted string.
dbGetQuery(con,'select "month_idx", "year", "month",
sum(case when "term_deposit" = \'yes\' then 1.0
else 0.0 end) as subscribe,
count(*) as total from "bank"
group by "month_idx", "year", "month"')
4.7 Web Scraping
Create a Scraping Function
First, you will need to load all the libraries for this task.
#
General-purpose data wrangling
library(tidyverse)
#
Parsing of HTML/XML files
132
Statistics with R Programming
library(rvest)
#
String manipulation
library(stringr)
#
Verbose regular expressions
library(rebus)
#
Eases DateTime manipulation
library(lubridate)
Extract the Information of One Page
We want to extract the review text, rating, name of the author and time of
submission of all the reviews on a subpage. Web page given below:
133
Statistics with R Programming
For each of the data fields we write one extraction function using the tags.
134
Statistics with R Programming
get_reviews <- function(html){
html %>%
# The relevant tag
html_nodes('.review-body') %>%
html_text() %>%
# Trim additional white space
str_trim() %>%
# Convert the list into a vector
unlist()
}
get_reviewer_names <- function(html){
html %>%
html_nodes('.user-review-name-link')
%>%
html_text() %>%
str_trim() %>%
unlist()
}
135
Statistics with R Programming
The datetime information is a little trickier, as it is stored as an attribute. In
general, you look for the most broad description and then try to cut out all
redundant information. Because time information not only appears in the
reviews, you also have to extract the relevant status information and filter by
the correct entry.
get_review_dates <- function(html){
status <- html %>%
html_nodes('time') %>%
# The status information is this time a
tag
# attribute
html_attrs() %>%
# Extract the second element
map(2) %>%
unlist()
dates <- html %>%
html_nodes('time') %>%
html_attrs() %>%
map(1) %>%
# Parse the string into a datetime
object with
136
Statistics with R Programming
# lubridate
ymd_hms() %>%
unlist()
# Combine the status and the date
information
# to filter one via the other
return_dates <- tibble(status = status,
dates = dates) %>%
# Only these are actual reviews
filter(status == 'ndate') %>%
# Select and convert to vector
pull(dates) %>%
# Convert DateTimes to POSIX objects
as.POSIXct(origin = '1970-01-01
00:00:00')
# The lengths still occasionally do not
lign
# up. You then arbitrarily crop the dates
to
# fit
137
Statistics with R Programming
# This can cause data imperfections,
however
# reviews on one page are generally close
time)
length_reviews <length(get_reviews(html))
return_reviews <- if (length(return_dates)>
length_reviews){
return_dates[1:length_reviews]
} else{
return_dates
}
return_reviews
}
The last function we need is the extractor of the ratings. We will use regular
expressions for pattern matching.
get_star_rating <- function(html){
# The pattern you look for: the first digit
after
138
Statistics with R Programming
# `count-`
pattern = 'count-'%R% capture(DIGIT)
ratings <-
html %>%
html_nodes('.star-rating') %>%
html_attrs() %>%
# Apply the pattern match to all attributes
map(str_match, pattern = pattern) %>%
# str_match[1] is the fully matched string,
the
# second entry
# is the part you extract with the capture
in your # pattern
map(2) %>%
unlist()
# Leave out the first instance, as it
is not part
# of a review
ratings[2:length(ratings)]
}
139
Statistics with R Programming
After we have tested that the individual extractor functions work on a single
URL we will then combine them to create a tibble, which is essentially a data
frame, for the whole page.
get_data_table <- function(html,
company_name){
# Extract the Basic information from the
HTML
reviews <- get_reviews(html)
reviewer_names <- get_reviewer_names(html)
dates <- get_review_dates(html)
ratings <- get_star_rating(html)
# Combine into a tibble
combined_data <- tibble(reviewer =
reviewer_names,
date = dates,
rating = ratings,
review = reviews)
# Tag the individual data with the company
name
140
Statistics with R Programming
combined_data %>%
mutate(company = company_name) %>%
select(company, reviewer, date, rating,
review)
}
We wrap this function in a command that will extracts the HTML from the
URL such that handling becomes more convenient.
get_data_from_url <- function(url,
company_name){
html <- read_html(url)
get_data_table(html, company_name)
}
In the last step, we will apply this function to many URLs. To do this, we use
the map() function. It applies the same function over the items of a list.
Finally, we will write one convenient function that takes as input the URL of
the landing page. It extracts all reviews, binding them into one tibble. The
map function applies the get_data_from_url() function.
scrape_write_table <- function(url,
company_name){
# Read first page
first_page <- read_html(url)
141
Statistics with R Programming
# Extract the number of pages that have to
be
# queried
latest_page_number <get_last_page(first_page)
# Generate the target URLs
list_of_pages <- str_c(url, '?page=',
1:latest_page_number)
# Apply the extraction and bind the
individual # results back into one table,
# which is then written as a tsv file into
the # working directory
list_of_pages %>%
# Apply to all URLs
map(get_data_from_url, company_name) %>%
# Combine the tibbles into one tibble
bind_rows() %>%
# Write a tab-separated file
write_tsv(str_c(company_name,'.tsv'))
}
142
Statistics with R Programming
We save the result to disk in a tab-separated file, instead of the common
comma-separated files (CSV), because the reviews may contain commas,
which may confuse the parser.
143
Statistics with R Programming
APPENDIX A
Reserved words in R Programming
Reserved words are the set of words that have special meaning and function. These
keywords cannot be written as identifiers. Following table highlights the reserved words of
R programming.
Reserved words in R list can be viewed by typing help(reserved) or ?reserved at the R
command prompt.
for
while
If
else
repeat
function
in
next
break
TRUE
FALSE
NULL
Inf
NaN
NA
NA_integer_
NA_real_
NA_complex_
144
Statistics with R Programming
NA_character
…
switch
145
Statistics with R Programming
APPENDIX B
Standard identifiers of R programming
Identifiers can be a combination of letters, digits, period (.) and underscore (_).
It must start with a letter or a period. If it starts with a period, it cannot be followed by a
digit.
Reserved words in R cannot be used as identifiers.
total
Sum
.fine.with.dot
aVariableName
as.numeric
levels
mydata
kgoto
sum
mean
x1
x2
y1
y2
uif
eend
aand
sdo
iwhile
vector1
array1
list1
matrix1
klabel
tchar
smod
number1
APPENDIX C
146
Statistics with R Programming
Standard Procedures in R Programming
Procedure
Purpose
help()
Apropos()
library(help=packageName)
example()
args()
opens help page
displays all objects matching topic
help on a specific package
provides a sample demo
arguments for a function
Help on control flow statements (e.g.
if, for, while)
Help on operators acting to extract or
replace subsets of vectors
Help on logical operators
Help on regular expressions used in R
Help on R syntax and giving the
precedence of operators
add elements to a vector
Combine vectors by row/column
regular expressions
test if 2 objects are exactly equal
no. of elements in vector
list objects in current environment
minimum and maximum
repeat the number x, n times
elements of x in reverse order
sequence (x to y, spaced by n)
sort the vector x
list the sorted element numbers of x
Convert string to lower/upper case
letters
remove duplicate entries from vector
rounding functions
return working directory
set working directory
?Control
?Extract
?Logic
?regex
?Syntax
append()
cbind()
grep()
identical()
length()
ls()
range(x)
rep(x,n)
rev(x)
seq(x,y,n)
sort(x)
order(x)
tolower(),toupper()
unique(x)
round(x), signif(x), trunc(x)
getwd()
setwd()
147
Statistics with R Programming
APPENDIX D
Standard Functions of R Programming
Function
sqrt()
sum()
log(x), log10(), exp()
cos(x)
sin(x)
tan(x)
%%
%/%
%*%
union()
intersect()
setdiff()
eigen()
deriv()
integrate()
Purpose
Gives square root of a
number
For Addition on numbers
Gives logarithmic value of x
Gives cosine value in degree
or radians
Gives sine value in degree
or radians
Gives tangent value in
degree or radians
Modulus value of a number
Integer division
Matrix multiplication
Gives union of a set
Gives intersect of a set
Gives comparison difference
of two sets
Gives Eigen values and
Eigen Vectors
symbolic and algorithmic
derivatives of simple
expressions
Adaptive quadrature over a
finite or infinite interval
Parameter Type
Integer, real
Integer, real
Integer, real
Integer, real
Integer, real
Integer, real
Integer, real
Integer
Integer, real
Character
Character
Character
Integer, real
Integers, real
Integer, real
148
Statistics with R Programming
Bibliography
[1] Longhow Lam, A guide to Eclipse and the R plug-in StatET. www.splusbook.com,
2007
[2] Diethelm Wurtz, “S4 ‘timedate’ and ‘timeseries’ classes for R,” Journal of
Statistical Software.
[3] Longhow Lam, An Introduction to R Programming. Springer, 2010.
[4] D. M. Bates and D. G. Watts (1988), Nonlinear Regression Analysis and Its
Applications. John Wiley & Sons, New York.
[5] S. D. Silvey (1970), Statistical Inference. Penguin, London.
[6] W. N. Venables and B. D. Ripley, S Programming. Springer, 2000.
[7] D. Samperi, “Rcpp: R/C++ interface classes, using c++ libraries from R,” 2006.
Online Resources
[1] https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
Emmanuel Paradis, R for Beginners
[2] http://www.columbia.edu/~cjd11/charles_dimaggio/DIRE/resources/R/rFunctionsList.pdf
Charles DiMaggio, List of useful R Functions, Feb 2013.
[3] https://www.rstudio.com/wp-content/uploads/2016/05/base-r.pdf
Base, R-Studio Cheatsheet
[4] https://www.datamentor.io/r-programming/
Learning R Programming
[5] https://data-flair.training/blogs/r-tutorial/
R tutorial | Introduction to R Programming – Features and Applications
149
Download