Code Erklärung ? Search help for a certain command ?? Search help that contains the term “log” # comment script “open” button open existing scriptfiles “To Source” button transfer commands from your history window to the script age <- c(25, 56, 65, 32, 41, 49) Generate a variable for age tax.cat <- housing.df$TAX generate new variable that is equal to TAX mean(housing.df$MEDV) altern.: mean(df$price[df$fuel_type==„cng"]) min(housing.df$MEDV) max(housing.df$MEDV) sd(housing.df$ se(housing.df$MEDV) bank.df$Education <- factor(bank.df$Education, levels = c(1,2,3), labels = • Education is coded as integer, we want to recode it as factor to c("Undergrad", "Graduate", "Advanced/Professional")) • treat Education as categorical (R will create dummy variables) Load Dataset getwd() Get working directory setwd("X:/Data Analytics/R/Data") Set working directory housing.df <- read.csv("WestRoxbury.csv", header = TRUE) load data • Header in csv = header in R View(housing.df) Open whole dataset in a new tab housing.df[1:10, 1] # show sample • first 10 rows of the first column only ("1 to 10 in column 1") housing.df[1:10, ] • first 10 rows of each of the columns ("1 to 10 of all variables") housing.df[5, 1:10] • fifth row of the first 10 columns housing.df$TOTAL.VALUE show the whole first column housing.df$TOTAL.VALUE[1:10] show first 10 rows of the first column (of the variable TOTAL.VALUE) mean(housing.df$TOTAL.VALUE) get the mean of the first column (= varialbe: total value) summary(housing.df) summary statistics for each column/ of all variables e.g. to detect outliers Code Erklärung How to sample set.seed(4) fix a starting point for the "pseudo" random draw s <- sample(row.names(housing.df), 5, prob = ifelse(housing.df$ROOMS>10, 0.9, 0.01)) generate "s" • that is equal to a random sample • with the row names of our housing.df dataset • Has 5 rows • overweight the rare class "houses with more than 10 rooms“ • 10: # of rooms • 0.9: probability • 0.01: probablity of „else“ Draw with a probability of 0.9 houses with more than 10 rooms & with a probability of 0.01 other houses housing.df[s, ] Show sample Preprocessing and Cleaning the Data names(housing.df) print a list of variables to the screen colnames(housing.df)[1] <- c("TOTAL_VALUE") change the first column's name from TOTAL.VALUE to TOTAL_VALUE class(housing.df$REMODEL) See type levels(housing.df[, 14]) See levels summary(housing.df$BEDROOMS) See summary statistics housing.df$test.bedrooms <- as.numeric(housing.df$BEDROOMS + 10) generate housing.df$test.bedrooms2 <- ifelse(housing.df$test.bedrooms > 13, c("above"), c("below")) generated a new variable that is the recode of test.bedrooms • it is equal to "above" when test.bedrooms ist bigger than 13 • Euqal to "below" when test.bedrooms is smaller than or equal to 13 • the syntax is ifelse(test, yes, no) attach(housing.df) housing.df$test.bedrooms2[BEDROOMS > 1] <- "small" housing.df$test.bedrooms2[BEDROOMS > 2 & BEDROOMS <= 3] <- "Middle" housing.df$test.bedrooms2[BEDROOMS > 3] <- "big" detach(housing.df) for more than two categories: housing.df$test.bedrooms2 <- Drop variable NULL Code Erklärung Generate Binary Dummy Variables install.packages(„dummies") library(dummies) • housing.df.test <- dummy.data.frame(housing.df, sep = „.“) • toyota.df.test <- dummy(toyota.df$Fuel_Type, sep = „.“) • Generate dummy • from data frame (data frame = housing.df) • From a single variable or variable name (= toyota.df$Fuel_Type) Sep =„.“: seperator for character used between variable name & • value names(housing.df.test) Shows name of data frame Missing Data • housing.df[rows.to.missing,]$BEDROOMS <- median(housing.df$BEDROOMS, na.rm = TRUE) • housing.df$BEDROOMS[is.na(housing.df$$BEDROOMS] <median(housing.df$BEDROOMS, na.rm = TRUE) Missing Data • replace the missing values using the median of the remaining values. • is.na(housing.df$$BEDROOMS): is any missing data in BEDROOMS? • use median() with na.rm = TRUE to ignore missing values when computing Normalising & Rescaling centered.tax.cat <- scale(tax.cat, center=TRUE, scale=TRUE) normalize tax.cat (=x) (subtract mean from each value & then divide by the standard deviation) install.packages("scales") library(scales) rescaling each variable to a [0,1] scale with rescale() (we subtract the minimum value & then divide by the range.) • to: output range • from: input range • Na.rm = TRUE: ignore missing values when computing centered2.tax.cat <- rescale(tax.cat, to = c(0, 1), from = range(tax.cat, na.rm = TRUE)) Code Erklärung Data Partitions set.seed(1) to get the same partitions when re-running the R code (always in front of sample) ## partitioning into training (50%), validation (30%), test (20%): train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.5) valid.rows <- sample(setdiff(rownames(housing.df), train.rows), dim(housing.df)[1]*0.3) test.rows <- setdiff(rownames(housing.df), union(train.rows, valid.rows)) Sample Training Set • sample: sample from housing data • dim: set the same dimension as in data frame • 0.6: sample 60% of the data in the data frame • [1]: Sample Validation Set • setdiff: Take different observations/ row names than in train.rows • 0.3: sample 30% Sample Test Set • setdiff(..union()): draw only from records not already in the train & valid set train.data <- housing.df[train.rows, ] valid.data <- housing.df[valid.rows, ] test.data <- housing.df[test.rows, ] create the 3 data frames by collecting all columns from the appropriate rows (remember that the partitioned data includes both the original categorical variables and the dummies, and during modeling we should not use both sets) train.index <- sample(c(1:dim(bank.df)[1], dim(bank.df)[1]*0.6 Alternative way of creating training & validation set • c(): vector from which to choose from • dim(): same dimension as in data frame • 0.6: select 60% of the data from data frame train.df <- bank.df[train.index, ] valid.df <- bank.df[-train.index, ] • train: Create data frame by collecting all columns from the appropriate rows • valid: Create data frame by excluding all columns from the train.index rows Code Erklärung Plotting Graphs install.package(„ggplot2“) library(ggplot2) Preliminary for ggplots ggplot(housing.df, aes(MEDV)) + geom_histogram(binwidth=1, fill = „navy") Histogram • Histogram d. Housing data frames mit MEDV in x-Achse • Binwidth: Balkenbreite • fill: Balkenfarbe ggplot(housing.df, aes(x=CHAS, y=MEDV)) + stat_boxplot(geom = "errorbar", width = 0.25) + geom_boxplot(width=0.5, fill = "navy", alpha=0.7) Boxplot • aes: • x: x-axis name • y: y-axis name • stat_boxplot: calculate compontens of box & whiskers plot • geom: geometric object used to display data (=errorbar) • width: Weite des objectes • geom_boxplot:Adds box & outliers & MEDV • width: Weite • fill: Farbe • alpha: aes(x = as.factor(CHAS), y=MEDV) For grouped boxplot • x = as.factor(CHAS): two box plots next to each other ggplot(housing.df, aes(x=DIS, y=MEDV)) + geom_point(color="navy", alpha=0.7) Scatterplot • Univariat ggplot(car.df, aes(y=Price, x=HP)) + geom_point() + expand_limits(x = 0, y = 0) + stat_smooth(method = 'lm', se = FALSE) ggplot(housing.df, aes(x=DIS, y=MEDV, color=as.factor(CHAS))) + geom_point(alpha=0.7) • Multivariat require(GGally) ggpairs(housing.df[, c(1 ,3 12, 13)]) Scatterplot Matrix Code Erklärung heatmap.2(cor(housing.df), Rowv = FALSE, Colv = FALSE, dendrogram = "none", cellnote = round(cor(housing.df),2), notecol = "black", key = FALSE, trace = ‚none', margins = c(10,10)) Correlation Heatmap • cor: correlation • Rowv = FALSE, Colv = FALSE: No dendrogram is computed & no recording is done dendrogram =„none“: indicates to draw no dendrogram • cellnote = round(cor(housing.df),2): matrix of character strings • which will be placed within each colorcell (auf 2 Kommastellen runden) notecol = „black“: color of cellnote text • trace = ‚none‘: no solid trace line • margins = c(10,10): spanne für col. & row. Names • ggplot(housing.df, aes(x=as.factor(CHAS))) + geom_bar(width=0.5, fill = "navy", alpha=1) ggplot(housing.df, aes(x=as.factor(CHAS), y=MEDV)) + geom_bar(width=0.5, fill = "navy", alpha=1, stat = "summary", fun.y = "mean") ggplot(housing.df) + geom_bar(aes(x = as.factor(RAD), y = MEDV), stat = "summary", fun.y = "mean", fill = "navy", alpha=0.8) + xlab("RAD") + facet_grid(CHAS ~ .) ggplot(trains.df, aes(x=as.DATE(Month), y=Ridership)) + geom_line + geom_point() Barchart • y-Achse ist heir automatisch „count“ • y-Achse ist MEDV • stat=„summary“: • fun.y=„mean“: • Depicting side-by-side displays (multivariate relationships) • X-lab(„RAD“): • facet_grid(CHAS..): Line Graphs Summary Statistics Calculated library(psych) describe(housing.df) describeFast(housing.df) describe(housing.df[,c(1,13,8)], skew=FALSE, quant=c(.25, .50, .75)) Produce most frequently used/ requested stats of psychology studies in easy to read data frame produces # of total cases, complete cases,…. • dFast: • skew=FALSE: Should skew & curtsies be calculated Specify quantities to be calculated • quant= : help(describe) Frequency tables table(housing.df$CHAS) Frequency tables/ counts of variables Code Erklärung install.packages(„summary tools“) library(summarytools) freq(housing.df$CHAS) Frequency tables/ counts of variables, but more infos rm(list=ls()) Removes all data pacman::p_load(ggplot2, forecast, leaps, Hmisc) Multiple Regression car.lm <- lm(Price ~ ., data = train.df) Multiple Regression • . after ~: to include all the remaining columns as predictors • data: data to use • Write „summary“ to get Regression Output hist(all.residuals, breaks = 25, xlab = "Residuals", main = "") Histogram • Of all.residuals • Breaks every 25 • x-lab: x-achsen Beschriftung • main: Überschrift Reducing the Number of Predictors: Exhaustive Search library(leaps) Preliminary to run an exhaustive search train.df <- cbind(train.df[,-4], Fuel_Type[,]) replace Fuel_Type column with 2 dummies (unlike with lm, categorical predictors must be turned into dummies manually) head(train.df) Gives header of the data frame search <- regsubsets(Price ~ ., data = train.df, nbest = 1, nvmax = dim(train.df)[2], method = "exhaustive") • Regsubsets is code for exhaustive search • nbest = • Nvmax = Sum <- summary(search) Saves the summary of the data frame in „sum“ sum$which Shows which predictors should be used (the more TRUE the better) sum$rsq show metrics: R2 sum$cp show metrics: Mallow's Cp => stop when values increase again ## Alternative Ways of Reducing the Number of Predictors: Popular Subset Selection Algorithms # Backward selection: car.lm.step <- step(car.lm, direction = "backward") to run stepwise regression set directions = to either "backward", "forward", or „both" ( selected.vars <- names(car.lm.step$model) ) • With additional brackets it shows the headers Code Erklärung round( cor(train.df[, selected.vars]) , 2) • Gives correlation matrix with 2 Nachkommastellen of the selected variables Fuel_Type_val <- as.data.frame(model.matrix(~ 0 + Fuel_Type, data=valid.df)) valid.df <- cbind(valid.df[,-4], Fuel_Type_val[,]) Alternative way of creating dummy variables Exercise Sheet 6 and 10th Chapter colnames(df) <- tolower(colnames(df)) alle spaltennamen im dataframe in kleinen Buchstaben schreiben levels(df$fuel_type) <- tolower(levels(df$fuel_type)) df <- subset(df, select=c("price", "fuel_type", "km", "hp")) #only keep these four variables subset des dataframes erstellen, da er auch df heißt wie zuvor, ist er jetzt ersetzt df <- subset(df, select = -c(id, zip.code)) #the minus-sign drops columns! df$education <- factor(df$education, levels = c(1, 2, 3), labels = c("_undergrad", "_graduate", „_advanced")) education is coded as integer, we want to recode it as factor # treat education as categorical (R will create dummy variables) Logistic Regression logit.simple <- glm(personal.loan ~ income, data = train.df, family = "binomial") Code Erklärung summary(logit.simple) Call: glm(formula = personal.loan ~ income, family = "binomial", data = train.df) Deviance Residuals: Min 1Q Median -2.201 -0.299 -0.169 3Q -0.107 Max 2.770 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.27275 0.24877 -25.2 <2e-16 *** income 0.03840 0.00184 20.8 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1915.1 Residual deviance: 1207.3 AIC: 1211 on 2999 on 2998 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 6 str(logit.simple$coefficients) Named num [1:2] -6.2727 0.0384 - attr(*, "names")= chr [1:2] "(Intercept)" "income" ( b0 <- logit.simple$coefficients[1] ) ( b1 <- logit.simple$coefficients[2] ) > ( b0 <- logit.simple$coefficients[1] ) (Intercept) -6.27 > ( b1 <- logit.simple$coefficients[2] ) income 0.0384 p <- function(x) exp(b0 + b1*x) / (1 + exp(b0 + b1*x)) Odds Code ggplot(train.df, aes(y= personal.loan, x=income)) + geom_point() + stat_function(fun = p) + xlim(0,250) logit.simple.pred <- predict(logit.simple, valid.df, type = "response") classifications <- as.factor(ifelse(logit.simple.pred > 0.5, 1, 0)) confusionMatrix(classifications, as.factor(valid.df$personal.loan)) Erklärung