Uploaded by Maria Hart

Data Analyitcs R-Code

advertisement
Code
Erklärung
?
Search help for a certain command
??
Search help that contains the term “log”
#
comment script
“open” button
open existing scriptfiles
“To Source” button
transfer commands from your history window to the script
age <- c(25, 56, 65, 32, 41, 49)
Generate a variable for age
tax.cat <- housing.df$TAX
generate new variable that is equal to TAX
mean(housing.df$MEDV)
altern.: mean(df$price[df$fuel_type==„cng"])
min(housing.df$MEDV)
max(housing.df$MEDV)
sd(housing.df$
se(housing.df$MEDV)
bank.df$Education <- factor(bank.df$Education, levels = c(1,2,3), labels = • Education is coded as integer, we want to recode it as factor to
c("Undergrad", "Graduate", "Advanced/Professional"))
• treat Education as categorical (R will create dummy variables)
Load Dataset
getwd()
Get working directory
setwd("X:/Data Analytics/R/Data")
Set working directory
housing.df <- read.csv("WestRoxbury.csv", header = TRUE)
load data
• Header in csv = header in R
View(housing.df)
Open whole dataset in a new tab
housing.df[1:10, 1]
# show sample
• first 10 rows of the first column only ("1 to 10 in column 1")
housing.df[1:10, ]
• first 10 rows of each of the columns ("1 to 10 of all variables")
housing.df[5, 1:10]
• fifth row of the first 10 columns
housing.df$TOTAL.VALUE
show the whole first column
housing.df$TOTAL.VALUE[1:10]
show first 10 rows of the first column (of the variable
TOTAL.VALUE)
mean(housing.df$TOTAL.VALUE)
get the mean of the first column (= varialbe: total value)
summary(housing.df)
summary statistics for each column/ of all variables
e.g. to detect outliers
Code
Erklärung
How to sample
set.seed(4)
fix a starting point for the "pseudo" random draw
s <- sample(row.names(housing.df), 5, prob = ifelse(housing.df$ROOMS>10,
0.9, 0.01))
generate "s"
• that is equal to a random sample
• with the row names of our housing.df dataset
• Has 5 rows
• overweight the rare class "houses with more than 10 rooms“
• 10: # of rooms
• 0.9: probability
• 0.01: probablity of „else“
Draw with a probability of 0.9 houses with more than 10 rooms &
with a probability of 0.01 other houses
housing.df[s, ]
Show sample
Preprocessing and Cleaning the Data
names(housing.df)
print a list of variables to the screen
colnames(housing.df)[1] <- c("TOTAL_VALUE")
change the first column's name from TOTAL.VALUE to TOTAL_VALUE
class(housing.df$REMODEL)
See type
levels(housing.df[, 14])
See levels
summary(housing.df$BEDROOMS)
See summary statistics
housing.df$test.bedrooms <- as.numeric(housing.df$BEDROOMS + 10)
generate
housing.df$test.bedrooms2 <- ifelse(housing.df$test.bedrooms > 13,
c("above"), c("below"))
generated a new variable that is the recode of test.bedrooms
• it is equal to "above" when test.bedrooms ist bigger than 13
• Euqal to "below" when test.bedrooms is smaller than or equal to 13
• the syntax is ifelse(test, yes, no)
attach(housing.df)
housing.df$test.bedrooms2[BEDROOMS > 1] <- "small"
housing.df$test.bedrooms2[BEDROOMS > 2 & BEDROOMS <= 3] <- "Middle"
housing.df$test.bedrooms2[BEDROOMS > 3] <- "big"
detach(housing.df)
for more than two categories:
housing.df$test.bedrooms2 <-
Drop variable
NULL
Code
Erklärung
Generate Binary Dummy Variables
install.packages(„dummies")
library(dummies)
• housing.df.test <- dummy.data.frame(housing.df, sep = „.“)
• toyota.df.test <- dummy(toyota.df$Fuel_Type, sep = „.“)
• Generate dummy
• from data frame (data frame = housing.df)
• From a single variable or variable name (= toyota.df$Fuel_Type)
Sep
=„.“: seperator for character used between variable name &
•
value
names(housing.df.test)
Shows name of data frame
Missing Data
• housing.df[rows.to.missing,]$BEDROOMS <- median(housing.df$BEDROOMS,
na.rm = TRUE)
• housing.df$BEDROOMS[is.na(housing.df$$BEDROOMS] <median(housing.df$BEDROOMS, na.rm = TRUE)
Missing Data
• replace the missing values using the median of the remaining
values.
• is.na(housing.df$$BEDROOMS): is any missing data in BEDROOMS?
• use median() with na.rm = TRUE to ignore missing values when
computing
Normalising & Rescaling
centered.tax.cat <- scale(tax.cat, center=TRUE, scale=TRUE)
normalize tax.cat (=x)
(subtract mean from each value & then divide by the standard
deviation)
install.packages("scales")
library(scales)
rescaling each variable to a [0,1] scale with rescale()
(we subtract the minimum value & then divide by the range.)
• to: output range
• from: input range
• Na.rm = TRUE: ignore missing values when computing
centered2.tax.cat <- rescale(tax.cat, to = c(0, 1), from = range(tax.cat,
na.rm = TRUE))
Code
Erklärung
Data Partitions
set.seed(1)
to get the same partitions when re-running the R code (always in
front of sample)
## partitioning into training (50%), validation (30%), test (20%):
train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.5)
valid.rows <- sample(setdiff(rownames(housing.df), train.rows),
dim(housing.df)[1]*0.3)
test.rows <- setdiff(rownames(housing.df), union(train.rows, valid.rows))
Sample Training Set
• sample: sample from housing data
• dim: set the same dimension as in data frame
• 0.6: sample 60% of the data in the data frame
• [1]:
Sample Validation Set
• setdiff: Take different observations/ row names than in train.rows
• 0.3: sample 30%
Sample Test Set
• setdiff(..union()): draw only from records not already in the
train & valid set
train.data <- housing.df[train.rows, ]
valid.data <- housing.df[valid.rows, ]
test.data <- housing.df[test.rows, ]
create the 3 data frames by collecting all columns from the
appropriate rows
(remember that the partitioned data includes both the original
categorical variables and the dummies, and during modeling we
should not use both sets)
train.index <- sample(c(1:dim(bank.df)[1], dim(bank.df)[1]*0.6
Alternative way of creating training & validation set
• c(): vector from which to choose from
• dim(): same dimension as in data frame
• 0.6: select 60% of the data from data frame
train.df <- bank.df[train.index, ]
valid.df <- bank.df[-train.index, ]
• train: Create data frame by collecting all columns from the
appropriate rows
• valid: Create data frame by excluding all columns from the
train.index rows
Code
Erklärung
Plotting Graphs
install.package(„ggplot2“)
library(ggplot2)
Preliminary for ggplots
ggplot(housing.df, aes(MEDV))
+ geom_histogram(binwidth=1, fill = „navy")
Histogram
• Histogram d. Housing data frames mit MEDV in x-Achse
• Binwidth: Balkenbreite
• fill: Balkenfarbe
ggplot(housing.df, aes(x=CHAS, y=MEDV))
+ stat_boxplot(geom = "errorbar", width = 0.25)
+ geom_boxplot(width=0.5, fill = "navy", alpha=0.7)
Boxplot
• aes:
• x: x-axis name
• y: y-axis name
• stat_boxplot: calculate compontens of box & whiskers plot
• geom: geometric object used to display data (=errorbar)
• width: Weite des objectes
• geom_boxplot:Adds box & outliers & MEDV
• width: Weite
• fill: Farbe
• alpha:
aes(x = as.factor(CHAS), y=MEDV)
For grouped boxplot
• x = as.factor(CHAS): two box plots next to each other
ggplot(housing.df, aes(x=DIS, y=MEDV))
+ geom_point(color="navy", alpha=0.7)
Scatterplot
• Univariat
ggplot(car.df, aes(y=Price, x=HP))
+ geom_point()
+ expand_limits(x = 0, y = 0)
+ stat_smooth(method = 'lm', se = FALSE)
ggplot(housing.df, aes(x=DIS, y=MEDV, color=as.factor(CHAS)))
+ geom_point(alpha=0.7)
• Multivariat
require(GGally)
ggpairs(housing.df[, c(1 ,3 12, 13)])
Scatterplot Matrix
Code
Erklärung
heatmap.2(cor(housing.df), Rowv = FALSE, Colv = FALSE, dendrogram =
"none", cellnote = round(cor(housing.df),2), notecol = "black", key =
FALSE, trace = ‚none', margins = c(10,10))
Correlation Heatmap
• cor: correlation
• Rowv = FALSE, Colv = FALSE: No dendrogram is computed & no
recording is done
dendrogram
=„none“: indicates to draw no dendrogram
•
cellnote
=
round(cor(housing.df),2): matrix of character strings
•
which will be placed within each colorcell (auf 2 Kommastellen
runden)
notecol
= „black“: color of cellnote text
•
trace
=
‚none‘: no solid trace line
•
margins
= c(10,10): spanne für col. & row. Names
•
ggplot(housing.df, aes(x=as.factor(CHAS)))
+ geom_bar(width=0.5, fill = "navy", alpha=1)
ggplot(housing.df, aes(x=as.factor(CHAS), y=MEDV))
+ geom_bar(width=0.5, fill = "navy", alpha=1, stat = "summary",
fun.y = "mean")
ggplot(housing.df)
+ geom_bar(aes(x = as.factor(RAD), y = MEDV), stat = "summary",
fun.y = "mean", fill = "navy", alpha=0.8)
+ xlab("RAD") + facet_grid(CHAS ~ .)
ggplot(trains.df, aes(x=as.DATE(Month), y=Ridership))
+ geom_line
+ geom_point()
Barchart
• y-Achse ist heir automatisch „count“
• y-Achse ist MEDV
• stat=„summary“:
• fun.y=„mean“:
• Depicting side-by-side displays (multivariate relationships)
• X-lab(„RAD“):
• facet_grid(CHAS..):
Line Graphs
Summary Statistics Calculated
library(psych)
describe(housing.df)
describeFast(housing.df)
describe(housing.df[,c(1,13,8)], skew=FALSE, quant=c(.25, .50, .75))
Produce most frequently used/ requested stats of psychology studies
in easy to read data frame
produces # of total cases, complete cases,….
• dFast:
• skew=FALSE: Should skew & curtsies be calculated
Specify quantities to be calculated
• quant= :
help(describe)
Frequency tables
table(housing.df$CHAS)
Frequency tables/ counts of variables
Code
Erklärung
install.packages(„summary tools“)
library(summarytools)
freq(housing.df$CHAS)
Frequency tables/ counts of variables, but more infos
rm(list=ls())
Removes all data
pacman::p_load(ggplot2, forecast, leaps, Hmisc)
Multiple Regression
car.lm <- lm(Price ~ ., data = train.df)
Multiple Regression
• . after ~: to include all the remaining columns as predictors
• data: data to use
• Write „summary“ to get Regression Output
hist(all.residuals, breaks = 25, xlab = "Residuals", main = "")
Histogram
• Of all.residuals
• Breaks every 25
• x-lab: x-achsen Beschriftung
• main: Überschrift
Reducing the Number of Predictors: Exhaustive Search
library(leaps)
Preliminary to run an exhaustive search
train.df <- cbind(train.df[,-4], Fuel_Type[,])
replace Fuel_Type column with 2 dummies (unlike with lm,
categorical predictors must be turned into dummies manually)
head(train.df)
Gives header of the data frame
search <- regsubsets(Price ~ ., data = train.df, nbest = 1, nvmax =
dim(train.df)[2], method = "exhaustive")
• Regsubsets is code for exhaustive search
• nbest =
• Nvmax =
Sum <- summary(search)
Saves the summary of the data frame in „sum“
sum$which
Shows which predictors should be used (the more TRUE the better)
sum$rsq
show metrics: R2
sum$cp
show metrics: Mallow's Cp => stop when values increase again
## Alternative Ways of Reducing the Number of Predictors: Popular Subset Selection Algorithms
# Backward selection:
car.lm.step <- step(car.lm, direction = "backward")
to run stepwise regression
set directions = to either "backward", "forward", or „both"
( selected.vars <- names(car.lm.step$model) )
• With additional brackets it shows the headers
Code
Erklärung
round( cor(train.df[, selected.vars]) , 2)
• Gives correlation matrix with 2 Nachkommastellen of the selected
variables
Fuel_Type_val <- as.data.frame(model.matrix(~ 0 + Fuel_Type,
data=valid.df))
valid.df <- cbind(valid.df[,-4], Fuel_Type_val[,])
Alternative way of creating dummy variables
Exercise Sheet 6 and 10th Chapter
colnames(df) <- tolower(colnames(df))
alle spaltennamen im dataframe in kleinen Buchstaben schreiben
levels(df$fuel_type) <- tolower(levels(df$fuel_type))
df <- subset(df, select=c("price", "fuel_type", "km", "hp")) #only keep
these four variables
subset des dataframes erstellen, da er auch df heißt wie zuvor, ist
er jetzt ersetzt
df <- subset(df, select = -c(id, zip.code))
#the minus-sign drops columns!
df$education <- factor(df$education, levels = c(1, 2, 3), labels =
c("_undergrad", "_graduate", „_advanced"))
education is coded as integer, we want to recode it as factor
# treat education as categorical (R will create dummy variables)
Logistic Regression
logit.simple <- glm(personal.loan ~ income, data = train.df, family =
"binomial")
Code
Erklärung
summary(logit.simple)
Call:
glm(formula = personal.loan ~ income, family = "binomial", data =
train.df)
Deviance Residuals:
Min
1Q Median
-2.201 -0.299 -0.169
3Q
-0.107
Max
2.770
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.27275
0.24877
-25.2
<2e-16 ***
income
0.03840
0.00184
20.8
<2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1915.1
Residual deviance: 1207.3
AIC: 1211
on 2999
on 2998
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 6
str(logit.simple$coefficients)
Named num [1:2] -6.2727 0.0384
- attr(*, "names")= chr [1:2] "(Intercept)" "income"
( b0 <- logit.simple$coefficients[1] )
( b1 <- logit.simple$coefficients[2] )
> ( b0 <- logit.simple$coefficients[1] )
(Intercept)
-6.27
> ( b1 <- logit.simple$coefficients[2] )
income
0.0384
p <- function(x) exp(b0 + b1*x) / (1 + exp(b0 + b1*x))
Odds
Code
ggplot(train.df, aes(y= personal.loan, x=income)) + geom_point() +
stat_function(fun = p) + xlim(0,250)
logit.simple.pred <- predict(logit.simple, valid.df, type = "response")
classifications <- as.factor(ifelse(logit.simple.pred > 0.5, 1, 0))
confusionMatrix(classifications, as.factor(valid.df$personal.loan))
Erklärung
Download