R Workshop: Day 2 Yun Ju Sung 7/14/2014 Instructor • a PhD statistician with an interest in new statistical methods for the genomic analysis of human complex diseases • Education: • BS in Mathematics at Pohang Univ. of Science and Technology in South Korea • MS and PhD in Statistics at the Univ. of Minnesota • Postdoc training in Medical Genetics at the Univ. of Washington • a Research Assistant Professor and work with DC Rao on many grants related to the genetics of blood pressure, cardiovascular disease, and related conditions • Also a course master of the Fundamental of Genetic Epidemiology (with Treva Rice) Source for lecture material • simpleR – using R for Introductory Statistics by John Verzani R package: Simple http://www.math.csi.cuny.edu/Statistics/R/simpleR • Using R for Introductory Statistics by John Verzani R package: UsingR • Data Analysis and Graphics Using R: An Example-Based Approach by John Maindonald and W. Hohn Braun R package: DAAG http://maths-people.anu.edu.au/~johnm/ Outline 1. 2. 3. 4. 5. 6. Introduction to R Data Univariate Data Bivariate Data Regression Analysis Multivariate Data With any programming language, you cannot learn by watching some else: you have to do it yourself. So get your hands dirty! YouTube videos for new R user • R Tutorial series by tutorial (https://www.youtube.com/watch?v=ZoPJGmpYJzw&list= PL69A9CCD816A5F3A5&index=1) • Statistics with R series by Christoph Scherber (https://www.youtube.com/watch?v=Xh6Rex3ARjc) • Statistics with R series by Courtney Brown (https://www.youtube.com/watch?v=2-kw1MlOS1U) 1. Introduction to R include Friday’s material Brief history of R • R was originally written by Ross Ihaka and Robert Gentleman at the University of Auckland • It is an implementation of the S language, which was principally developed by John Chambers • In 1998, the Association for Computing Machinery gave John Chambers its Software Award. His citation reads: “S has forever altered the way people analyze, visualize, and manipulate data ... It is an elegant, widely accepted, and enduring software system, with conceptual integrity.” • The R Project (www.r-project.org) Reasons for using R • R is free (copy it down from the internet). Use is covered by the Free Software Foundation's GNU General Public License, which is designed to guarantee the freedom of users to develop and give away the software • R runs on a wide variety of systems: Windows, MacOS X, UNIX (including FreeBSD), and Linux • R has state of the art statistical and graphical abilities, and strong scientific computational abilities, with new features regularly added • R has a vibrant and rapidly growing user community, who contribute by discussion on various email lists, by adding new abilities, and by writing books and papers that are intended to help other users More reasons for using R • R has become a system of choice for statistical researchers. It is used increasingly for the development of software in many different areas of science and commerce • The R system has had, increasingly in the past five years, a leading role in statistical software innovation. Each year, the American Statistical Association Statistical Computing and Graphics Section makes a $1000 cash award (the John M Chambers award) for statistical software written by, or in collaboration with, an undergraduate or graduate student. All winning entries from 2003 to 2010 have been for software that is associated with R. • R makes well-designed publication-quality plots that can incorporate mathematical symbols and formulae as needed Excellent features in R • R has an excellent built-in help system. • R has excellent graphing capabilities. • The language has a powerful, easy to learn syntax with many built-in statistical functions. • The language is easy to extend with user-written functions. • R is a computer programming language. For programmers it will feel more familiar than others. For new computer users, the next leap to programming will not be so large. • Students can easily migrate to the commercially supported S-Plus program if commercial software is desired. R as a calculator > 1 + 1 # Simple Arithmetic [1] 2 # The comment character (#) is used to make comments. > 2 + 3 * 4 # Operator precedence [1] 14 > 3 ^ 2 # Exponentiation [1] 9 > exp(1) # Basic math. functions are available [1] 2.718282 > sqrt(10) [1] 3.162278 > pi # The constant pi is predefined [1] 3.141593 > 2*pi*6378 # Circumference of earth at equator (in km) [1] 40074.16 R as a smart calculator > x = 1 > y = 3 # Can define variables # using “=" to assign values. # You can also use “<-”. > z = 4 > x * y * z [1] 12 > X * Y * Z # names are case sensitive Error: Object "X" not found > This.Year = 2004 # names can include period > This.Year [1] 2004 R does a lot more! • Definitely not just a calculator • R can manipulate vectors, matrices and datasets • R has many built-in statistical functions • R produces excellent graphics • R allows you to define your own functions 2. Data include Friday’s material What is data? • When we read the newspaper or watch TV news, we find data and its interpretation. • Most often the data is presented in a summarized format, letting the reader draw conclusions. • Statistics allow us to summarize data in the familiar terms of counts, proportions, and averages. • So let us to learn about data: how to summarize it, how to present it, and how to infer from it when appropriate. Entering data with c • The most useful R command for quickly entering in small data sets is the c function, which combines or concatenates terms together. • Example: suppose we have the following count of the number of typos per page: 2 3 0 3 1 0 0 1 • In R • We assigned the values to a variable called typos • The value of the typos doesn't automatically print out. It does when we type the name • The value of typos is prefaced with a funny looking [1]. This indicates that the value is a vector. Data is a vector • The data is stored in R as a vector. This means that it keeps track of the order that the data is entered in. • This is a good thing for several reasons • Our simple data vector typos has a natural order: page 1, page 2 etc. We wouldn't want to mix these up. • We can make changes to the data item by item instead of having to enter in the entire data set again. • Vectors are also a mathematical object. There are natural extensions of mathematical concepts such as addition and multiplication that make it easy to work with data when they are vectors. Vectors in R • Created with • c() to concatenate elements • rep() to repeat elements or patterns • seq() or m:n to generate sequences • Most mathematical functions and operators can be applied to vectors without loops! • Possible to select and edit groups of elements simultaneously Example with vectors in R > rep(1,10) [1] 1 1 1 1 1 1 1 > seq(2,6) [1] 2 3 4 5 6 > seq(4,20,by=4) [1] 4 8 12 16 20 # 1 # # # repeats the number 1, 10 times 1 1 sequence of integers between 2 and 6 equivalent to 2:6 Every 4th integer between 4 and 20 > x = c(2,0,0,4) # Create vector with elements 2,0,0,4 > y = c(1,9,9,9) > x + y # Sums elements of two vectors [1] 3 9 9 13 > x * 4 # Multiplies elements [1] 8 0 0 16 > sqrt(x) # Function applies to each element [1] 1.41 0.00 0.00 2.00 # Returns vector Accessing vector elements • To extract data from a vector, use slicing and extraction as below. • Use the [] operator to select elements • To select specific elements, use index or vector of indexes to identify them • To exclude specific elements, use negate index or vector of indexes • Alternatively, use vector of T and F values to select subset of elements Example > x = c(2,0,0,4) > x[1] # Select the first element, equivalent to x[c(1)] [1] 2 > x[-1] # Exclude the first element [1] 0 0 4 > x[1] = 3 ; x [1] 3 0 0 4 > x[-1] = 5 ; x [1] 3 5 5 5 > y < 9 # Compares each element, returns result as vector [1] TRUE FALSE FALSE FALSE > y[4] = 1 > y < 9 [1] TRUE FALSE FALSE TRUE > y[y<9] = 2 # Edits elements marked as TRUE in index vector > y [1] 2 9 9 2 Assignment: Question 1 Try to guess the results of these R commands. Remember, the way to access entries in a vector is with []. Suppose we assume > x = c(1,3,5,7,9) > y = c(2,3,5,7,11,13) a. x+1 b. y*2 c. length(x) and length(y) d. x + y e. sum(x>5) and sum(x[x>5]) f. sum(x>5 | x< 3) g. y[3] h. y[-3] i. y[x] j. y[y>=7] Examples with typos • Suppose we want to keep track of our various drafts as the typos change. • Or • the assignment to the first entry in the vector typos.draft2 is done by referencing the first entry in the vector. This is done with square brackets [ ] • parentheses () are for functions, and square brackets [ ] are for vectors (and arrays and lists). Apply a function • R comes with many built-in functions that one can apply to data such as typos. One of them is the mean function for finding the mean or average of the data. • Call the median or var to find the median or sample variance. • The syntax is the same: the function name followed by parentheses to contain the argument(s): Assignment: Question 2 Let the data x be given by x = c(1, 8, 2, 6, 3, 8, 5, 5, 5, 5) Use R to compute the following functions. Note, we use X1 to denote the first element of x (which is 1) etc. a. (X1 + X2 + … + X10)/10 (use sum) b. Find log10(Xi) for each i. (Use the log function which by default is base e) c. Find (Xi -4.4)/2.875 for each i. (Do it all at once) d. Find the difference between the largest and smallest values of x. (This is the range. You can use max and min or guess a built in command.) Assignment: Question 3 Suppose you track your commute times for two weeks (10 days) and you find the following times in minutes 17 16 20 24 22 15 21 15 17 22 Enter this into R. a. Use the function max to find the longest commute time, the function mean to find the average and the function min to find the minimum. b. Oops, the 24 was a mistake. It should have been 18. How can you fix this? Do so, and then find the new average. c. How many times was your commute 20 minutes or more? To answer this one can try (if you called your numbers commutes) sum (commutes >= 20) What do you get? d. What percent of your commutes are less than 17 minutes? How can you answer this with R? Use graphs to check data • Graphics are important for conveying important features of the data. • Numerical summaries, such as an average, can be very useful, but important features of the data may be missed without a glance at an appropriate graph. • This is the best way to begin investigation of a new set of data, drawing attention to obvious errors or quirks in the data, or to obvious clues that the data contains. • The use of graphs to display and help understand data has a long tradition. John W. Tukey formalized and extended this tradition, giving it the name Exploratory Data Analysis. • Data should, as far as possible, have the opportunity to speak for themselves, prior to or as part of a formal analysis! 3. Univariate Data Graphics and other simple functions to explore univariate data, data with a single variable. Univariate data • Data can be of three types: categorical, discrete numeric and continuous numeric: methods for viewing and summarizing the data depend on the type. • The U.S. census (http://www.census.gov) asks questions of a categorical nature. • A doctor's chart which records data on a patient. • The gender or the history of illnesses can be treated as categories. • The age of a person and their weight are numeric quantities. The age is a discrete numeric quantity and the weight as well (most people don't say they are 4.673 years old). These numbers are usually reported as integers. • If one really needed to know precisely, they could in theory take on a continuum of values, and we would consider them to be continuous. Table for categorical data • The table command allows us to look at tables. Its simplest usage looks like table(x) where x is a categorical variable. • Example: Smoking survey. A survey asks people if they smoke or not. The data is • We can enter this into R with the c() command, and summarize with the table command as follows • The table command simply adds up the frequency of each unique value of the data Assignment: Question 4 The number of O-ring failures for the first 23 flights of the US space shuttle Challenger were 0 1 0 NA 0 0 0 0 0 1 1 1 0 0 3 0 0 0 0 0 2 0 1 (NA means not available - the equipment was lost). Make a table of the possible categories. Try to find the mean. (You might need to try mean(x, na.rm=TRUE) to avoid the value NA, or look at x[!is.na(x)].) Bar charts • A bar chart draws a bar with a height proportional to the count in the table. • Suppose, a group of 25 people are surveyed as to their beerdrinking preference. The categories were (1) Domestic can, (2) Domestic bottle, (3) Microbrew and (4) import. The raw data is 3411343313212123231111431 Bar charts in R • In R • To read in the data, use scan(), which is very useful for reading data from a file or by typing. You type in the data. It stops adding data when you enter a blank row. (Try ?scan for more information.) • We don't use barplot with the raw data. • Use the table command to create summarized data, then use barplot to create the barplot of frequencies shown. • For proportion, divide summarized data by the number of data points. Pie charts R codes: Center and spread for numeric data • R commands for common numerical summaries are mean, var, sd, median and summary. • Example: CEO salaries. A sample of CEO annual salaries (in millions): 12 .4 5 2 50 8 3 1 4 0.25 Stem-and-leaf charts • If the data set is relatively small, the stem-and-leaf diagram is useful for seeing the shape of the distribution and the values. • Use apropos() when you think you know the function's name but aren't sure. Histograms • The simplest way to view a distribution of numeric data • rug() gives the tick marks just above the x-axis • jitter(x) gives a little jitter to the x values to eliminate ties Boxplots • The boxplot is useful to summarize data succinctly, displaying if the data is symmetric or has suspected outliers. • The boxplot has a box with lines at Q1, Median, Q3 and whiskers which extend to the min and max. • To showcase possible outliers, the whiskers are shorten to a length of 1:5 times the box length. Any points beyond that are plotted with points. • We can check quickly for symmetry and outliers (data points beyond the whiskers). Example: Movie sales • data on movie revenues for the 25 biggest movies of a given week. • Boxplots of the current and gross sales • Both distributions are skewed, but the gross sales are less so. This shows why Hollywood is interested in the “big hit", as a real big hit can generate a lot more revenue than quite a few medium sized hits. Assignment: Questions 5 and 6 5. Make a histogram and boxplot of three data sets: south, crime and aid. a. Which of these data sets is skewed? b. Which has outliers? c. Which is symmetric? 6. For the data sets bumpers, firstchi, math make a histogram. Try to predict the mean, median and standard deviation. Check your guesses with the appropriate R commands. 4. Bivariate Data Graphics and other simple functions to explore bivariate data, data with two variables Bivariate data • With univariate data, we summarized a data set with measures • • • • of center and spread and the shape of a distribution with words such as “symmetric” and “long-tailed.” With bivariate data we can ask additional questions about the relationship between the two variables. For example, are height and weight related? Are age and heart rate related? Are income and taxes paid related? Is a new drug better than an old drug? Does the weather depend on the previous days weather? If a bivariate data set has a natural pairing, such as (x1, y1), …, (xn, yn), then it makes sense to investigate the data set jointly. We will focus on relationships in numeric data. Scatterplots to compare relationships • The scatterplot is simple but important tool for investigating pairwise relationships (for example, the height of a father compared to their sons height). • Home data example shows old assessed value (1970) versus new assessed value (2000). There should be some relationship. • Linear model will be covered later. Correlation between two variables • The correlation between two variables numerically describes whether larger values of one variable are related to larger values of the other variable. • A valuable numeric summary of the strength of the linear relationship is the Pearson correlation coefficient. The Spearman rank correlation • To get the Pearson correlation coefficient, use cor • If the relationship between the variables is not linear but is increasing, we can still use the correlation coefficient to understand the strength of the relationship. We use the ranked data. • This is the Spearman rank correlation, which is the Pearson correlation coefficient computed with the ranked data. • Is there another way to get the Spearman correlation? Pearson vs. Spearman correlation • Example: the Pearson correlation for 4 cases • In the 2nd plot, the Pearson correlation is 0.878, while the Spearman correlation is 928. • When a linear fit is inadequate, Spearman correlation better captures the strength of relationship. Assignment: Question 7 The data set mammals (in MASS package) contains data on body weight versus brain weight. a. Use the cor to find the Pearson and Spearman correlation coefficients. Are they similar? b. Plot the data using the plot command and see if you expect them to be similar. You should be unsatisfied with this plot. c. Next, plot the logarithm (log) of each variable and see if that makes a difference. Assignment: Question 8 The data set mtcars contains information about cars from a 1974 Motor Trend issue. Answer the following: a. What are the variable names? (Try names.) b. What is the maximum mpg? Which car has this? c. What are the first 5 cars listed? d. What horsepower (hp) does the “Valiant” have? e. What are all the values for the Mercedes 450slc (Merc 450SLC)? f. Make a scatterplot of cylinders (cyl) vs. miles per gallon (mpg). Fit a regression line. Is this a good candidate for linear regression? R Basics: Reading in datasets with library and data The library and data command can be used in several different ways • To list all available packages: Use the command library(). • To list all available datasets: Use the command data() without any arguments • To list all data sets in a given package: Use data(package='package name') for example data(package=Simple). • To read in a dataset: Use data('dataset name'). As in the example data(movies). You need to load the package to access its datasets as in the command library(“Simple”). • To find out information about a dataset: You can use the help command to see if there is documentation on the data set. For example help(“movies") or equivalently ?movies Assignment: Question 9 In the library MASS, a dataset UScereal contains information about popular breakfast cereals. Investigate the following relationships, and make comments on what you see. You can use tables, barplots, scatterplots etc. to do your investigation. a. the relationship between manufacturer and shelf b. the relationship between fat and vitamins c. the relationship between fat and shelf d. the relationship between carbohydrates and sugars e. the relationship between fibre and manufacturer f. the relationship between sodium and sugars Are there other relationships you can predict and investigate? 5. Regression Analysis Regression analysis is fundamental and forms a major part of statistical analysis! Linear regression model • Linear regression can be used to study the linear relationship for paired data sets (x, y). • When x and y have a linear relationship in a mathematical sense, y = mx + b, where m is the slope of the line and b the intercept. • In statistics, we don’t assume these variables have an exact linear relationship: rather, we consider the possibility for noise or error. • The regression model is yi=β0+β1xi+εi • The value εi is an error term • The coefficients β0 and β1are the regression coefficients Linear regression analysis • The regression model: y = β0 + β1x + ε • The values of β0 and β1 are unknown and will be estimated in a reasonable manner from the data • The estimated regression line is yˆ ˆ 0 ˆ 1 x (using "hats" to denote the estimates) • For each data point xi we have yˆ i ˆ 0 ˆ1 x i (called the predicted value) • The difference between the true and predicted value is the residual e i y i yˆ i Statistical model: signal vs. noise • Statistical models have both deterministic and random error components, or signal components and noise components. observation = signal + noise (β0 + β1x is signal and ε is noise in linear model) • After fitting a model, we have observation = fitted value + residual ( ˆ 0 ˆ1 x is fitted value and e is residual) which we can think of as observation = smooth component + rough component. • The idea is that fitted value will recapture most of the signal and the residual will contain mostly noise. American Society of Human Genetics Example: linear regression with R • The maximum heart rate of a person is often said to be related to age by the equation: Max.rate= 220 - Age • Suppose this is to be empirically proven and 15 people of varying ages are tested for their maximum heart rate. • We use lm() to fit a linear model R’s model formula notation • To fit a linear model, we use lm(y~x) • The most basic usage for lm is • • • • • lm(formula) The formula is a model formula that represents the simple linear regression model. The response variable is on the left hand side and the predictor on the right: response ~ predictor In our example, this is y ~ x, where ~ in this notation is read is modeled by. So, the model formula y ~ x would be read: y is modeled by x. The model formula implicitly assumes an intercept term and a linear model. Linear regression with R • The result of the lm function can be stored: • We can use summary to get regression coefficients (and more). • The result of the lm function is of class lm and so the plot and summary commands have been adapted. • For several generic functions (including print, plot, and summary), the result depends on the class of object that is given as argument.