R Acknowledgement: http://www.stat.columbia.edu/~martin/ W2024/W2024.html And many other websites Introduction to R Introduction: 2 R is a free software programming language and software environment for statistical computing and graphics. R's popularity has increased substantially in recent years Installing R for Windows Download and install: 3 Download R for Windows from link: http://cran.r-project.org/ Double click “R-3.1.1-win.exe” and follow the setup wizard until installation finished successfully. Open R Open R 4 Double click R icon on your windows desktop or click R command in Start R R i386 3.1.0 Section 1 GETTING START WITH R 5 R R is a programming language and software environment for statistical computing and graphics. It is highly extensible and becomes the most popular language among statisticians for developing and introducing new statistical methods. 6 Basic commands To see which datasets are available in your workspace To get more info To quit R 7 Math operations 8 Vectors and Matrices 9 Vector is an array of numbers or characters Vectors and Matrices 10 Vectors and Matrices 11 Vectors and Matrices 12 Vectors and Matrices 13 Vectors and Matrices 14 Vectors and Matrices 15 Data Frames R often works with data frames, which are like matrices except the columns are allowed to be of different types (e.g., one column can be numerical, while another consists of characters) data.frame() 16 Data Frames 17 Data Frames 18 Data Frames 19 Reading and Writing Data read.table() can read an external file and create a data frame bank.txt 20 Built-in Datasets 21 Trees dataset: provides the measurements of the girth, height, and volume of timber in 31 felled black cherry trees. Built-in Datasets 22 Section 2 MEAN, MEDIAN, MODE 23 Calculate mean in R Create a vector a: Calculate mean of a and saved into b 24 Calculate median in R Create a vector “a”: Calculate median of vector “a” and saved into “b” 25 Calculate mode in R Create a vector “a”: Counts how many occurrences of each value 26 Operator “<-” has the same function as “=” The first row of “b" is a sorted list of all unique values in the vector “a". The second row in “b" counts how many occurrences of each value. Calculate mode in R Calculate the mode of vector “a” this command returns the names of the values that have the highest count in b's second row. Since the mode is the value(s) that occur most frequently in a vector or matrix, this line returns the mode. 27 Section 3 CENTRAL TENDENCY 28 Calculate range in R Create a vector “a”: Calculate range in R: 29 Calculate SD in R Create a vector “a”: Calculate SD in R: 30 sd() is similar to STDEV() in Excel. If you want to obtain result as STDEVP() in Excel, just multiply the sd() result with (N-1)/N Calculate variance in R Create a vector “a”: Calculate variance in R: 31 Variance is squared SD Section 4 Z SCORE 32 Calculate z score in R Zscore=(x-mean(x))/sd(x) Compute the z scores where mean=50 and the standard deviation =5 55 50 60 57.5 46 33 Z-test in R (1) Suppose that 10 volunteers have done an intelligence test The mean obtained at the same test, from the entire population is 75. If there is a statistically significant difference (with a significance level of 95%) between the means of the sample and the population? (assuming that the population variance is known and equal to 18.) 34 65, 78, 88, 55, 48, 95, 66, 57, 79, 81 Z-test in R (2) Add a function to R Apply the function P value for Z score calculator: http://www.socscistatistics.com/pvalues/normaldistribution.aspx 35 Section 5 T-TEST 36 T-test in R 37 T-test are used to determine whether the means of two groups are equal to each other. The assumption for the test is that both groups are sampled from normal distributions with equal variances. The null hypothesis is that the two means are equal, and the alternative is that they are not. Ref: http://statistics.berkeley.edu/computing/r-t-tests http://www.stat.columbia.edu/~martin/W2024/R2.pdf T-test in R 38 t.test() function in R can perform both one and two sample t-test on vectors of data T-test in R 39 t.test() function in R can perform both one and two sample t-test on vectors of data T-test in R 40 One-sample t-tests 41 One-sample t-tests 42 P<0.05, reject null hypothesis, so mean Salmonella level in the ice cream is greater than 0.3 MPN/g Two-sample t-tests 43 Two-sample t-tests P<0.05, rejected null hypothesis. P<0.05, rejected null hypothesis. 44 Paired t-test 45 Paired t-test P<0.05, rejected null hypothesis. 46 Section 6 ANOVA 47 ANOVA in R 48 Sometimes we need to determine whether the means from more than two populations or groups are equal or not. To test whether the difference in means is statistically significant, we use analysis of variance (ANOVA). The function in R is aov(). ANOVA in R 49 First, we can graphically compare the means of the variable of interest across groups. We can create boxplots of measurements organized in groups using plot() ANOVA in R 50 Example: A drug company tested three formulations of a pain relief medicine for migraine headache suffers. For the experiment 27 volunteers were selected and 9 were randomly assigned to one of three drug formulations. The subjects were instructed to take the drug during their next migraine headache episode and to report their pain on a scale of 1 to 10 (10 being the most) ANOVA in R 51 boxplots ANOVA in R aov(response~factor, data=data_name) summary() So the results say: F-value=11.91, p value=0.0003, which is very significant. So we reject the null hypothesis, there exist difference in the means of three drug groups 52 Multiple comparisons ANOVA F-test answers the question whether there are significant differences in the K population means. However, it does not tell how they differ. The function pairwise.t.test computes the pair-wise difference between group means. 53 Multiple comparisons B and C are not significant, but A and B, and A and C are as p value less than 0.05 or 0.01. So drug A is very different 54 Multiple comparisons 55 Another multiple comparisons procedure is Tukey’s method (i.e., Tukey’s Honest Significance Test). The function TukeyHSD() creates a set of confidence intervals on the differences between means with the specified family-wise probability of coverage. Multiple comparisons 56 Two-way ANOVA 57 Two-way ANOVA is a technique for studying the relationship between a quantitative dependent variable and two qualitative independent variables. Two-way ANOVA 58 Example: The student was interested in her success at basketball free throws. This study investigate whether there was any relationship between the quantitative variable “number of shots made” and two qualitative variables “Time of Day” and “Shoe Worn”. Two-way ANOVA 59 R can read data from a text file. The text file has to be in a form of table with columns representing variables with the first row of the file for names of variables. All columns must be the same length. Missing data uses “NA”. R is case-sensitive Two-way ANOVA After using attach, now the variables are attached with object basketball. Now we are ready to run ANOVA 60 Two-way ANOVA 61 First, we can compare the two times (morning vs. night), or the two shoes (favorite vs. others) by looking at summary statistics or boxplots. To get the means for each level of each factor, use tapply() It seems that she does better at night and in her favorite shoes. But that could just be due to natural variability. So we use ANOVA. Two-way ANOVA P-value for the interaction is 0.38, so we have to keep the null hypothesis. That means that the interaction of time and shoes will not change the free throw performance. P-value for both variables >0.05, so keep the null hypothesis. The each of the individual variables will not affect the final result 62 Two-way ANOVA 63 Section 7 CORRELATION 64 Correlation in R 65 Create your data file. Use a spreadsheet and make each column a variable. Each row is a replicate. The first row should contain the variable names. Save this as a .CSV file (R_Correlation.csv) Ref: http://www.gardenersown.co.uk/Education/Lectures/R/index.htm Correlation in R 66 Read the data into R and save as some name Correlation in R 67 Allow the factors within the data to be accessible to R Correlation in R 68 Decide on the method, run the correlation and assign the result to a new variable. Methods are "pearson" (default), "kendal" and "spearman" Correlation in R 69 Perform a pairwise correlation on all the variables in the data set. Decide on the method ("pearson" (default), "kendal" and "spearman") Correlation in R 70 To evaluate the statistical significance of your correlation decide on the appropriate method (pearson is the default), assign a variable and run the test Correlation in R 71 Have a look at the result of yor significance test Correlation in R Plot a graph of the two variables from your correlation. pch=21 plots an open circle, pch=19 plots a solid circle. Try other values. Add a line of best fit (if appropriate) 72 Correlation in R 73 Graph: Section 8 REGRESSION 74 Correlation and Linear Regression 75 We are interested to study the relationship among variables to determine whether they are associated with one another. The changes in variable x, can explain or cause changes in variable y. X is called explanatory variable, y a response variable. If the plot looks like a straight line, it is a linear relationship. The relationship is strong if all the data points approximately make up a straight line and weak if the points are widely scattered about the line. Correlation and Linear Regression 76 The covariance and correlation are measures of the strength and direction of a linear relationship between two quantitative variables. A regression line is a mathematical model for describing a linear relationship between an explanatory variable, x and a response variable y. It can be used to predict the value of y for a given value of x. cov(), cor() Covariance and correlation 77 Example: A pediatrician wants to study the relationship between a child’s height and their head circumference (both measured in inches). She selects a SRS of 11 three-year old children and obtains the following data. Covariance and correlation 78 Covariance and correlation The variance of Height and Circ is 1.198 and 0.048. The covariance between Height and Circ is 0.219 indicating a positive relationship The correlation between Height and Circ is 0.911. Hence, there exists a strong positive linear relationship between the variables. 79 Linear Regression If there exists a strong linear relationship between two variables, it is often to model the relationship using a regression line. lm(response~explanatory) Circ=12.493+0.183Height Y=12.493+0.183X So one inch increase in height will lead to a 0.183 inch increase in head circumference 80 Linear Regression 81 Linear Regression 82 Next step is to verify all the relevant model assumptions needed for using the simple linear regression model. The residuals should be normally distributed with equal variance for every value of x. Linear Regression 83 The plot shows no apparent pattern in the residuals indicating no clear violations of any model assumptions Linear Regression 84 To check the normality assumption, we make the QQ-plots Or histogram Linear Regression Next step is to do inference. We want to construct tests and confidence intervals for the slope and intercept, confidence intervals for the mean response and prediction intervals for future observations. To test whether the slope is significantly different from 0 So we can reject null hypothesis of no linear relationship between height and circ (p=9.59e0.5) 85 R-squared R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. R-squared is the percentage of the response variable variation that is explained by a linear model R-squared=explained variation/total variation (0-100%) 86 0% means that the model explain none of the variability of the response data around its mean 100% means that the model explain all the variability of the response data around its mean. So the higher R-squared, the better the model fits your data. R-squared=0.83 indicating that 83% of the variability in the response is explained by the explanatory variable. Linear Regression Now we can use the regression equation to predict future values of the response variable. The predicted value of head circumference for a child of a given height has two interpretations: 87 Represent the mean circumference for all children whose height is x Represent the predicted circumference for a randomly selected child whose height is x The predicted value will be the same for both cases, but the standard error will be larger in the second case due to the additional variation of individual responses about the mean. Linear Regression To obtain a 95% confidence intervals for the mean head circumference of children who are 25 inches tall. The confidence interval lies in the range (16.95, 17.17) To obtain a 95% confidence intervals for the mean head circumference of a child who is 25 inches tall. The confidence interval lies in the range (16.81 17.30) 88 Multiple Linear Regression 89 Multiple Linear Regression We want to know the relationships between X1 and X2, X3, X4. X1 = first year box office receipts/millions X2 = total production costs/millions X3 = total promotional costs/millions X4 = total book sales/millions 90 Multiple Regression 91 First create your data file. Use a spreadsheet and make each column a variable. Each row is a replicate. The first row should contain the variable names. Save this as a .CSV file (R_Regression.csv) The data (X1, X2, X3, X4) are for each movie X1 = first year box office receipts/millions X2 = total production costs/millions X3 = total promotional costs/millions X4 = total book sales/millions Multiple Linear Regression Read the data into R and save as some name Allow the factors within the data to be accessible to R 92 Multiple Linear Regression 93 Have a first look at the data as a pairs graph (plots all combinations as scatter plots) Multiple Linear Regression Decide on the model, run it and assign the result to a new variable See the basic coefficients of your regression 94 Multiple Linear Regression A more detailed summary of your regression Overall p=7.913e05, the fitting is very good. X2 and X3 are of significance, but X4 is not. 95 Multiple Linear Regression 96 Once you have basic info, you can go ahead to exam more components of your regression model Examine an individual coefficient Beta coefficients that are standardized again one another to show the relative strengths. A beta coefficient is determined as Calculate the beta coefficients (you will need to do one for each x factor) Display all your beta coefficients Multiple Linear Regression 97 R-squared value tells us how strong the fit is (the proportion of the explained variance). However, R only shows the value for the overall model. We can find the individual R-squared values once we know the beta coefficients: R2=beta*correlation(X,Y) Calculate the R-squared value for each components Display all your R-squared values Multiple Linear Regression 98 Plot a graph of two variables from your regression Multiple Linear Regression 99 Add a line of best fit Transformations 100 In many situations there exists a non-linear relationship between the variables. This can sometimes be remedied by applying a suitable transformation of the variables, such as power transformations or logarithms. Transformations 101 Data were collected on the number of academic journals published on the Internet during the period of 1991-1997 Transformations Clearly, there is a non-linear relationship between year and journals. Taking the logarithm of number of journals may be appropriate before fitting a simple linear regression model 102 Transformations 103 Transformations The residual plot shows no apparent pattern. It shows that a simple linear regression model is appropriate for the transformed data 104 Transformations 105 Now we predict the number of journals published in 1998 (x=8) Section 9 CHI-SQUARE 106 Chi-Squared in R In an election survey, voters might be classified by gender (male or female) and voting preference (Democrat, Republican, or Independent). We wan to determine whether gender is related to voting preference. We use chi-square test for independence 107 Chi-Squared in R When to use chi-square test for independence: The sample method is simple random sampling Each population is at least 10 times as large as its reprehensive sample The variables under study are each categorical Four steps 108 State the hypotheses, formulate an analysis plan, analyze sample data, and interpret results Chi-Squared in R 109 Example: A public opinion poll surveyed a simple random sample of 1000 voters. Respondents were classified by gender (male or female) and by voting preference (Republican, Democrat, or Independent). Results is below. Is there a gender gap? Do the men’s voting preferences differ significantly from the women’s preferences? Use a 0.05 level of significance. Chi-Squared in R 110 State the hypotheses: Chi-Squared in R Read in your file and assign it to a variable name. This command tells R that the 1st column contains the row names. Run the Chi-Squared test and assign it to a variable If you need to apply Yates correction for a 2 x 2 matrix 111 Chi-Squared in R 112 So we reject the null hypothesis (p<0.05). men’s voting preferences differ significantly from the women’s preferences Chi-Squared in R To see the original data i.e. observed values To see the expected values To see the Pearson residuals (O-E)/sqrt(E) 113 Section 10 CLUSTERING 114 K-means clustering in R This is the most basic algorithm 115 Pick an initial set of K centroids (this can be random or any other means) For each data point, assign it to the member of the closest centroid according to the given distance function Adjust the centroid position as the mean of all its assigned member data points. Go back to (2) until the membership isn't change and centroid position is stable. Output the centroids. K-means clustering in R This is the most basic algorithm 116 Pick an initial set of K centroids (this can be random or any other means) For each data point, assign it to the member of the closest centroid according to the given distance function Adjust the centroid position as the mean of all its assigned member data points. Go back to (2) until the membership isn't change and centroid position is stable. Output the centroids. K-means clustering in R Prepare data: Iris.csv Data introduction: 117 http://en.wikipedia.org/wiki/Iris_flower_data_set The dataset consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. K-means clustering in R 118 Read data in R K-means clustering in R K-means clustering 119 Using column 1-2 for clustering, 3 classes. K-means clustering in R Plot the result https://stat.ethz.ch/R-manual/Rpatched/library/graphics/html/point s.html 120 Section 11 R GRAPHICS 121 Built-in Datasets 122 Trees dataset: provides the measurements of the girth, height, and volume of timber in 31 felled black cherry trees. Built-in Datasets 123 Graphics 124 Box plot 125 Good R graphics Tutorial 126 http://teachpress.environmentalinformaticsmarburg.de/2013/07/creating-publicationquality-graphs-in-r-7/