1. Introduction to R

advertisement
R Workshop: Day 2
Yun Ju Sung
7/14/2014
Instructor
• a PhD statistician with an interest in new statistical methods
for the genomic analysis of human complex diseases
• Education:
• BS in Mathematics at Pohang Univ. of Science and Technology in South
Korea
• MS and PhD in Statistics at the Univ. of Minnesota
• Postdoc training in Medical Genetics at the Univ. of Washington
• a Research Assistant Professor and work with DC Rao on
many grants related to the genetics of blood pressure,
cardiovascular disease, and related conditions
• Also a course master of the Fundamental of Genetic
Epidemiology (with Treva Rice)
Source for lecture material
• simpleR – using R for Introductory Statistics by John Verzani
R package: Simple
http://www.math.csi.cuny.edu/Statistics/R/simpleR
• Using R for Introductory Statistics by John Verzani
R package: UsingR
• Data Analysis and Graphics Using R: An Example-Based
Approach by John Maindonald and W. Hohn Braun
R package: DAAG
http://maths-people.anu.edu.au/~johnm/
Outline
1.
2.
3.
4.
5.
6.
Introduction to R
Data
Univariate Data
Bivariate Data
Regression Analysis
Multivariate Data
With any programming language, you cannot learn by watching
some else: you have to do it yourself. So get your hands dirty!
YouTube videos for new R user
• R Tutorial series by tutorial
(https://www.youtube.com/watch?v=ZoPJGmpYJzw&list=
PL69A9CCD816A5F3A5&index=1)
• Statistics with R series by Christoph Scherber
(https://www.youtube.com/watch?v=Xh6Rex3ARjc)
• Statistics with R series by Courtney Brown
(https://www.youtube.com/watch?v=2-kw1MlOS1U)
1. Introduction to R
include Friday’s material
Brief history of R
• R was originally written by Ross Ihaka and Robert Gentleman
at the University of Auckland
• It is an implementation of the S language, which was
principally developed by John Chambers
• In 1998, the Association for Computing Machinery gave John
Chambers its Software Award. His citation reads: “S has
forever altered the way people analyze, visualize, and
manipulate data ... It is an elegant, widely accepted, and
enduring software system, with conceptual integrity.”
• The R Project (www.r-project.org)
Reasons for using R
• R is free (copy it down from the internet). Use is covered
by the Free Software Foundation's GNU General Public
License, which is designed to guarantee the freedom of
users to develop and give away the software
• R runs on a wide variety of systems: Windows, MacOS X,
UNIX (including FreeBSD), and Linux
• R has state of the art statistical and graphical abilities, and
strong scientific computational abilities, with new features
regularly added
• R has a vibrant and rapidly growing user community,
who contribute by discussion on various email lists, by
adding new abilities, and by writing books and papers that
are intended to help other users
More reasons for using R
• R has become a system of choice for statistical
researchers. It is used increasingly for the development of
software in many different areas of science and commerce
• The R system has had, increasingly in the past five years, a
leading role in statistical software innovation. Each year,
the American Statistical Association Statistical Computing
and Graphics Section makes a $1000 cash award (the John
M Chambers award) for statistical software written by, or
in collaboration with, an undergraduate or graduate
student. All winning entries from 2003 to 2010 have been
for software that is associated with R.
• R makes well-designed publication-quality plots that can
incorporate mathematical symbols and formulae as
needed
Excellent features in R
• R has an excellent built-in help system.
• R has excellent graphing capabilities.
• The language has a powerful, easy to learn syntax with
many built-in statistical functions.
• The language is easy to extend with user-written
functions.
• R is a computer programming language. For programmers
it will feel more familiar than others. For new computer
users, the next leap to programming will not be so large.
• Students can easily migrate to the commercially supported
S-Plus program if commercial software is desired.
R as a calculator
> 1 + 1
# Simple Arithmetic
[1] 2
# The comment character (#) is used to make comments.
> 2 + 3 * 4 # Operator precedence
[1] 14
> 3 ^ 2
# Exponentiation
[1] 9
> exp(1)
# Basic math. functions are available
[1] 2.718282
> sqrt(10)
[1] 3.162278
> pi
# The constant pi is predefined
[1] 3.141593
> 2*pi*6378 # Circumference of earth at equator (in km)
[1] 40074.16
R as a smart calculator
> x = 1
> y = 3
# Can define variables
# using “=" to assign values.
# You can also use “<-”.
> z = 4
> x * y * z
[1] 12
> X * Y * Z
# names are case sensitive
Error: Object "X" not found
> This.Year = 2004
# names can include period
> This.Year
[1] 2004
R does a lot more!
• Definitely not just a calculator
• R can manipulate vectors, matrices and datasets
• R has many built-in statistical functions
• R produces excellent graphics
• R allows you to define your own functions
2. Data
include Friday’s material
What is data?
• When we read the newspaper or watch TV news, we
find data and its interpretation.
• Most often the data is presented in a summarized
format, letting the reader draw conclusions.
• Statistics allow us to summarize data in the familiar
terms of counts, proportions, and averages.
• So let us to learn about data: how to summarize it, how
to present it, and how to infer from it when appropriate.
Entering data with c
• The most useful R command for quickly entering in small data sets
is the c function, which combines or concatenates terms together.
• Example: suppose we have the following count of the number of
typos per page: 2 3 0 3 1 0 0 1
• In R
• We assigned the values to a variable called typos
• The value of the typos doesn't automatically print out. It does
when we type the name
• The value of typos is prefaced with a funny looking [1]. This
indicates that the value is a vector.
Data is a vector
• The data is stored in R as a vector. This means that it keeps
track of the order that the data is entered in.
• This is a good thing for several reasons
• Our simple data vector typos has a natural order: page 1,
page 2 etc. We wouldn't want to mix these up.
• We can make changes to the data item by item instead of
having to enter in the entire data set again.
• Vectors are also a mathematical object. There are natural
extensions of mathematical concepts such as addition and
multiplication that make it easy to work with data when
they are vectors.
Vectors in R
• Created with
• c() to concatenate elements
• rep() to repeat elements or patterns
• seq() or m:n to generate sequences
• Most mathematical functions and operators can be
applied to vectors without loops!
• Possible to select and edit groups of elements
simultaneously
Example with vectors in R
> rep(1,10)
[1] 1 1 1 1 1 1 1
> seq(2,6)
[1] 2 3 4 5 6
> seq(4,20,by=4)
[1] 4 8 12 16 20
#
1
#
#
#
repeats the number 1, 10 times
1 1
sequence of integers between 2 and 6
equivalent to 2:6
Every 4th integer between 4 and 20
> x = c(2,0,0,4) # Create vector with elements 2,0,0,4
> y = c(1,9,9,9)
> x + y
# Sums elements of two vectors
[1] 3 9 9 13
> x * 4
# Multiplies elements
[1] 8 0 0 16
> sqrt(x)
# Function applies to each element
[1] 1.41 0.00 0.00 2.00 # Returns vector
Accessing vector elements
• To extract data from a vector, use slicing and
extraction as below.
• Use the [] operator to select elements
• To select specific elements, use index or vector of
indexes to identify them
• To exclude specific elements, use negate index or
vector of indexes
• Alternatively, use vector of T and F values to select
subset of elements
Example
> x = c(2,0,0,4)
> x[1]
# Select the first element, equivalent to x[c(1)]
[1] 2
> x[-1]
# Exclude the first element
[1] 0 0 4
> x[1] = 3 ; x
[1] 3 0 0 4
> x[-1] = 5 ; x
[1] 3 5 5 5
> y < 9
# Compares each element, returns result as vector
[1] TRUE FALSE FALSE FALSE
> y[4] = 1
> y < 9
[1] TRUE FALSE FALSE TRUE
> y[y<9] = 2
# Edits elements marked as TRUE in index vector
> y
[1] 2 9 9 2
Assignment: Question 1
Try to guess the results of these R commands. Remember, the way to
access entries in a vector is with []. Suppose we assume
> x = c(1,3,5,7,9)
> y = c(2,3,5,7,11,13)
a. x+1
b. y*2
c. length(x) and length(y)
d. x + y
e. sum(x>5) and sum(x[x>5])
f.
sum(x>5 | x< 3)
g. y[3]
h. y[-3]
i.
y[x]
j.
y[y>=7]
Examples with typos
• Suppose we want to keep track of our various drafts as the typos
change.
• Or
• the assignment to the first entry in the vector typos.draft2 is
done by referencing the first entry in the vector. This is done
with square brackets [ ]
• parentheses () are for functions, and square brackets [ ] are for
vectors (and arrays and lists).
Apply a function
• R comes with many built-in functions that one can apply to
data such as typos. One of them is the mean function for
finding the mean or average of the data.
• Call the median or var to find the median or sample
variance.
• The syntax is the same: the function name followed by
parentheses to contain the argument(s):
Assignment: Question 2
Let the data x be given by
x = c(1, 8, 2, 6, 3, 8, 5, 5, 5, 5)
Use R to compute the following functions. Note, we use X1 to
denote the first element of x (which is 1) etc.
a. (X1 + X2 + … + X10)/10 (use sum)
b. Find log10(Xi) for each i. (Use the log function which by
default is base e)
c. Find (Xi -4.4)/2.875 for each i. (Do it all at once)
d. Find the difference between the largest and smallest
values of x. (This is the range. You can use max and min or
guess a built in command.)
Assignment: Question 3
Suppose you track your commute times for two weeks (10 days) and
you find the following times in minutes
17 16 20 24 22 15 21 15 17 22
Enter this into R.
a. Use the function max to find the longest commute time, the
function mean to find the average and the function min to find the
minimum.
b. Oops, the 24 was a mistake. It should have been 18. How can you
fix this? Do so, and then find the new average.
c. How many times was your commute 20 minutes or more? To
answer this one can try (if you called your numbers commutes)
sum (commutes >= 20) What do you get?
d. What percent of your commutes are less than 17 minutes? How
can you answer this with R?
Use graphs to check data
• Graphics are important for conveying important features
of the data.
• Numerical summaries, such as an average, can be very
useful, but important features of the data may be missed
without a glance at an appropriate graph.
• This is the best way to begin investigation of a new set of
data, drawing attention to obvious errors or quirks in the
data, or to obvious clues that the data contains.
• The use of graphs to display and help understand data has
a long tradition. John W. Tukey formalized and extended
this tradition, giving it the name Exploratory Data Analysis.
• Data should, as far as possible, have the opportunity to
speak for themselves, prior to or as part of a formal
analysis!
3. Univariate Data
Graphics and other simple functions to
explore univariate data, data with a single
variable.
Univariate data
• Data can be of three types: categorical, discrete numeric and
continuous numeric: methods for viewing and summarizing the
data depend on the type.
• The U.S. census (http://www.census.gov) asks questions of a
categorical nature.
• A doctor's chart which records data on a patient.
• The gender or the history of illnesses can be treated as
categories.
• The age of a person and their weight are numeric quantities.
The age is a discrete numeric quantity and the weight as well
(most people don't say they are 4.673 years old). These
numbers are usually reported as integers.
• If one really needed to know precisely, they could in theory
take on a continuum of values, and we would consider them
to be continuous.
Table for categorical data
• The table command allows us to look at tables. Its simplest
usage looks like table(x) where x is a categorical variable.
• Example: Smoking survey. A survey asks people if they smoke or
not. The data is
• We can enter this into R with the c() command, and summarize
with the table command as follows
• The table command simply adds up the frequency of each
unique value of the data
Assignment: Question 4
The number of O-ring failures for the first 23 flights of the US
space shuttle Challenger were
0 1 0 NA 0 0 0 0 0 1 1 1 0 0 3 0 0 0 0 0 2 0 1
(NA means not available - the equipment was lost).
Make a table of the possible categories. Try to find the mean.
(You might need to try mean(x, na.rm=TRUE) to avoid the value
NA, or look at x[!is.na(x)].)
Bar charts
• A bar chart draws a bar with a height proportional to the count
in the table.
• Suppose, a group of 25 people are surveyed as to their beerdrinking preference. The categories were (1) Domestic can, (2)
Domestic bottle, (3) Microbrew and (4) import. The raw data is
3411343313212123231111431
Bar charts in R
• In R
• To read in the data, use scan(), which is very useful for reading data
from a file or by typing. You type in the data. It stops adding data
when you enter a blank row. (Try ?scan for more information.)
• We don't use barplot with the raw data.
• Use the table command to create summarized data, then use barplot
to create the barplot of frequencies shown.
• For proportion, divide summarized data by the number of data points.
Pie charts
R codes:
Center and spread for numeric data
• R commands for common numerical summaries are mean, var,
sd, median and summary.
• Example: CEO salaries. A sample of CEO annual salaries (in
millions): 12 .4 5 2 50 8 3 1 4 0.25
Stem-and-leaf charts
• If the data set is relatively small, the stem-and-leaf diagram is
useful for seeing the shape of the distribution and the values.
• Use apropos() when you think you know the function's name
but aren't sure.
Histograms
• The simplest way to view a distribution of numeric data
• rug() gives the tick marks just above the x-axis
• jitter(x) gives a little jitter to the x values to eliminate ties
Boxplots
• The boxplot is useful to summarize data succinctly, displaying if the
data is symmetric or has suspected outliers.
• The boxplot has a box with lines at Q1, Median, Q3 and whiskers which
extend to the min and max.
• To showcase possible outliers, the whiskers are shorten to a length of
1:5 times the box length. Any points beyond that are plotted with
points.
• We can check quickly for symmetry and outliers (data points beyond
the whiskers).
Example: Movie sales
• data on movie revenues for the 25 biggest movies of a given week.
• Boxplots of the current and gross sales
• Both distributions are skewed, but the gross
sales are less so. This shows why Hollywood
is interested in the “big hit", as a real big hit
can generate a lot more revenue than quite
a few medium sized hits.
Assignment: Questions 5 and 6
5. Make a histogram and boxplot of three data sets: south, crime
and aid.
a. Which of these data sets is skewed?
b. Which has outliers?
c. Which is symmetric?
6. For the data sets bumpers, firstchi, math make a histogram.
Try to predict the mean, median and standard deviation. Check
your guesses with the appropriate R commands.
4. Bivariate Data
Graphics and other simple functions
to explore bivariate data, data with
two variables
Bivariate data
• With univariate data, we summarized a data set with measures
•
•
•
•
of center and spread and the shape of a distribution with
words such as “symmetric” and “long-tailed.”
With bivariate data we can ask additional questions about the
relationship between the two variables.
For example, are height and weight related? Are age and heart
rate related? Are income and taxes paid related? Is a new drug
better than an old drug? Does the weather depend on the
previous days weather?
If a bivariate data set has a natural pairing, such as (x1, y1), …,
(xn, yn), then it makes sense to investigate the data set jointly.
We will focus on relationships in numeric data.
Scatterplots to compare relationships
• The scatterplot is simple but important tool for investigating
pairwise relationships (for example, the height of a father
compared to their sons height).
• Home data example shows old assessed value (1970) versus
new assessed value (2000). There should be some relationship.
• Linear model will be covered later.
Correlation between two variables
• The correlation between two variables numerically describes
whether larger values of one variable are related to larger
values of the other variable.
• A valuable numeric summary of the strength of the linear
relationship is the Pearson correlation coefficient.
The Spearman rank correlation
• To get the Pearson correlation coefficient, use cor
• If the relationship between the variables is not linear but is
increasing, we can still use the correlation coefficient to
understand the strength of the relationship. We use the ranked
data.
• This is the Spearman rank correlation, which is the Pearson
correlation coefficient computed with the ranked data.
• Is there another way to get the Spearman correlation?
Pearson vs. Spearman correlation
• Example: the Pearson correlation for 4 cases
• In the 2nd plot, the Pearson correlation is 0.878, while the
Spearman correlation is 928.
• When a linear fit is inadequate, Spearman correlation better
captures the strength of relationship.
Assignment: Question 7
The data set mammals (in MASS package) contains data on body
weight versus brain weight.
a. Use the cor to find the Pearson and Spearman correlation
coefficients. Are they similar?
b. Plot the data using the plot command and see if you expect
them to be similar. You should be unsatisfied with this plot.
c. Next, plot the logarithm (log) of each variable and see if that
makes a difference.
Assignment: Question 8
The data set mtcars contains information about cars from a 1974
Motor Trend issue. Answer the following:
a. What are the variable names? (Try names.)
b. What is the maximum mpg? Which car has this?
c. What are the first 5 cars listed?
d. What horsepower (hp) does the “Valiant” have?
e. What are all the values for the Mercedes 450slc (Merc
450SLC)?
f. Make a scatterplot of cylinders (cyl) vs. miles per gallon
(mpg). Fit a regression line. Is this a good candidate for linear
regression?
R Basics: Reading in datasets with library and
data
The library and data command can be used in several different ways
• To list all available packages: Use the command library().
• To list all available datasets: Use the command data() without any
arguments
• To list all data sets in a given package: Use data(package='package
name') for example data(package=Simple).
• To read in a dataset: Use data('dataset name'). As in the example
data(movies). You need to load the package to access its datasets as
in the command library(“Simple”).
• To find out information about a dataset: You can use the help
command to see if there is documentation on the data set. For
example help(“movies") or equivalently ?movies
Assignment: Question 9
In the library MASS, a dataset UScereal contains information
about popular breakfast cereals.
Investigate the following relationships, and make comments on
what you see. You can use tables, barplots, scatterplots etc. to
do your investigation.
a. the relationship between manufacturer and shelf
b. the relationship between fat and vitamins
c. the relationship between fat and shelf
d. the relationship between carbohydrates and sugars
e. the relationship between fibre and manufacturer
f. the relationship between sodium and sugars
Are there other relationships you can predict and investigate?
5. Regression Analysis
Regression analysis is fundamental
and forms a major part of statistical
analysis!
Linear regression model
• Linear regression can be used to study the linear relationship
for paired data sets (x, y).
• When x and y have a linear relationship in a mathematical
sense, y = mx + b, where m is the slope of the line and b the
intercept.
• In statistics, we don’t assume these variables have an exact
linear relationship: rather, we consider the possibility for noise
or error.
• The regression model is yi=β0+β1xi+εi
• The value εi is an error term
• The coefficients β0 and β1are the regression coefficients
Linear regression analysis
• The regression model: y = β0 + β1x + ε
• The values of β0 and β1 are unknown and will be estimated in a
reasonable manner from the data
• The estimated regression line is yˆ  ˆ 0  ˆ 1 x
(using "hats" to denote the estimates)
• For each data point xi we have
yˆ i  ˆ 0  ˆ1 x i
(called the predicted value)
• The difference between the true and
predicted value is the residual
e i  y i  yˆ i
Statistical model: signal vs. noise
• Statistical models have both deterministic and random error
components, or signal components and noise components.
observation = signal + noise
(β0 + β1x is signal and ε is noise in linear model)
• After fitting a model, we have
observation = fitted value + residual
( ˆ 0  ˆ1 x is fitted value and e is residual)
which we can think of as
observation = smooth component + rough component.
• The idea is that fitted value will recapture most of the signal
and the residual will contain mostly noise.
American Society of Human Genetics
Example: linear regression with R
• The maximum heart rate of a person is often said to be related
to age by the equation: Max.rate= 220 - Age
• Suppose this is to be empirically proven and 15 people of
varying ages are tested for their maximum heart rate.
• We use lm() to fit a linear model
R’s model formula notation
• To fit a linear model, we use lm(y~x)
• The most basic usage for lm is
•
•
•
•
•
lm(formula)
The formula is a model formula that represents the simple
linear regression model.
The response variable is on the left hand side and the predictor
on the right: response ~ predictor
In our example, this is y ~ x,
where ~ in this notation is read is modeled by.
So, the model formula y ~ x would be read: y is modeled by x.
The model formula implicitly assumes an intercept term and a
linear model.
Linear regression with R
• The result of the lm function can be stored:
• We can use summary to get regression coefficients (and more).
• The result of the lm function is of class lm and so the plot and
summary commands have been adapted.
• For several generic functions (including print, plot, and
summary), the result depends on the class of object that is
given as argument.
Download