--title: "Stats 230 Final Project" author: '' date: "4/19/2020" output: pdf_document: default html_document: default out.height: 20% out.width: 20% fig_width: 2 fig_height: 1 --```{r global_options, include=FALSE} knitr::opts_chunk$set(warning=FALSE, message=FALSE) ``` ```{r setup, echo = F, warning = FALSE, message = FALSE} knitr::opts_chunk$set(echo = TRUE) library(car) library(leaps) library(lubridate) library(rvest) library(olsrr) library(corrplot) library(leaps) source("http://www.reuningscherer.net/s&ds230/Rfuncs/regJDRS.txt") ``` # *Presidential Data Analysis* ![](https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/ gettyimages-515392598-master-1503595848.jpg) ### *Introduction* During the 2020 election year, it seems appropriate to take a look at presidential data, from George Washington to Donald Trump, to try to find patterns over time that could help us understand exactly what makes a president. A wide range of variables, covering physical attributes, to policy effectiveness, to political and educational experience, will be studied and searched for significant correlations. One of the main goals of this project was to create a readable data set of presidential data from all of the presidents (which does not exist as of now). ### *Data* The data frame is 22 variables long and 67 rows deep. Now, while there have been 44 presidents, there have been 58 elections, and 67 transitions of power. Due to death, assassination, and resigning, presidents come and go in non election years. To compensate for this messy distribution, I found it simpler to count each transition of power as a seperate row. For example, George Washington's first term counts as the first row, while his second term counts as the second. Abraham Lincoln takes up two rows because of his two elections, but the succession of Andrew Johnson after Lincoln's assassination gives Johnson his own row. ### *Variables* * ElectoralCollege: Proportion of electoral votes won by president in election. * AgeatElection : The age of the president at their election in years. * Inaug: The number of words in the presidents inauguration speeches. * ExecOrder: Average number of executive orders per year of time in office. NA's exists because they are already accounted for. For example, George Washington takes up two rows. Over the eight ears of his presidency, he averaged 1 executive order a year, this is only recorded once. * Supreme Court: Confirmed Supreme Court Nominations over course of presidency. * TotalJ: Total confirmed judicial appointees over course of presidency. * VetoSuccess: Total number of vetoes not overridden by Congress over course of presidency. * Educ: Highest level of education achieved. * PreviousExp: Highest level of political experience achieved before presidency. * Height: In Feet and inches. 6.2 reads as 6 feet 2 inches. * NewtWorth: NetWorthorth in millions. 1 represents less than < 1 million. * ReRan: 0 <- Did not run for reelection, 1 <- Ran for reelection. * ReWon: 0 <- Did not win reelection, 1 <- Won reelection. ### *Data Cleaning Process* Every variable in the Presidents data set was web scraped and cleaned according to its unique characterristics. I used sources such as Senate.org, potus.facts, and various Wikipedia pages to accumulate all of these data points into one data frame. While each variable had a different process, I used for loops to append most of my raw data in order to make it fit my 67 rows. If you search for presidential data, you will get 44 or 45 results for any given variable, but I had to split all of this data between terms and introduce incidental presidencies at specific points. I encountered many issues in aligning my NA's to correspond correctly with gaps in presidential terms and with the reelections of presidents. I also ran into trouble web scraping from so many different sources, I had to use my selector tool very sparingly and carefully. Some of the functions I used to prepare this data frame were: gsub, for loop, append, trimws, as.numeric, round, rev, unname, etc. ```{r,echo=FALSE} #Election Year Variable url1 <- "https://www.infoplease.com/us/government/elections/presidentialelections-1789-2016" webpage1 <- read_html(url1) ElectionYear <- html_text(html_nodes(webpage1,'.sgmltable b')) #(ElectionYear) for ( i in c(14,17,22,27,33,39,46,51,55)) { ElectionYear <- append(ElectionYear, NA, after = i) } ElectionYear <- ElectionYear #Electoral College Percentage Win url2 <- "https://en.wikipedia.org/wiki/United_States_presidential_election" webpage2 <- read_html(url2) ElectionR <- html_text(html_nodes(webpage2,'tr:nth-child(18) .nowrap , tr:nth-child(13) .nowrap , td:nth-child(7) .nowrap')) ElectionR <-gsub("/", "", ElectionR) ElectionR <- gsub("\\s+", " ", ElectionR) ElectionR ElectionR ElectionR ElectionR ElectionR ElectionR ElectionR ElectionR ElectionR <-gsub("69 138", "069 69", ElectionR) <-gsub("1 138", "132 132", ElectionR) <-gsub("1 264", "071 138", ElectionR) <-gsub("73 276", "073 138", ElectionR) <-gsub("162 176", "162 176", ElectionR) <-gsub("113 176", "122 175", ElectionR) <-gsub("128 217", "128 217", ElectionR) <-gsub("183 217", "183 217", ElectionR) <-gsub("73 276", "231 232", ElectionR) ElectionR ElectionR ElectionR ElectionR ElectionR <-gsub("218 232", "084 261", ElectionR) <-gsub("74 261", "178 261", ElectionR) <-gsub("171 261", "219 286", ElectionR) <-gsub("189 286", "219 286", ElectionR) <-gsub("147 294", "170 294", ElectionR) Col1 <- substr(ElectionR, 1, 3) Col2 <- substr(ElectionR, 5, 7) Col1 <- as.numeric(Col1) Col2 <- as.numeric(Col2) ElectoralCollege <- round(Col1/Col2, 2) for ( i in c(14,17,22,27,33,39,46,51,55)) { ElectoralCollege <- append(ElectoralCollege, NA, after = i) } ElectoralCollege <- ElectoralCollege #President Name by Election Year url3 <- "https://en.wikipedia.org/wiki/United_States_presidential_election" webpage3 <- read_html(url3) Name <- html_text(html_nodes(webpage3,'.vcard:nth-child(2) .fn')) Name <- gsub("/.*", "", Name) Name <- gsub("(incumbent)", "", Name) Name <- gsub("\\()", "", Name) Name <- gsub("\\\n", "", Name) Name <- trimws(Name) #length(unique(Name)) Name Name Name Name Name Name Name Name Name <<<<<<<<<- append(Name, append(Name, append(Name, append(Name, append(Name, append(Name, append(Name, append(Name, append(Name, "John Tyler", after = 14) "Millard Filmore", after = 17) "Andrew Johnson", after = 22) "Chester A. Arthur", after = 27) "Theodore Roosevelt", after = 33) "Calvin Coolidge", after = 40) "Harry S. Truman", after = 46) "Lyndon B. Johnson", after = 51) "Gerald Ford", after = 55) PresidentName <- Name ElectionYear <- as.numeric(ElectionYear) #ElectoralCollege #PresidentName # Age at Start of Presidency url4 <- "https://en.wikipedia.org/wiki/ List_of_presidents_of_the_United_States_by_age" webpage4 <- read_html(url4) Age <- html_text(html_nodes(webpage4,'td:nth-child(4)')) Age <- Age[1:45] Age <- gsub("days.*", "", Age) Age <- gsub("51 Years, 6", "51 Years, 06", Age) Days <- as.numeric(substr(Age, 10, 13)) / 365 Year <- as.numeric(substr(Age, 1, 2)) NewYears <- round(Year + Days, 2) NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears NewYears <<<<<<<<<<<<<<<<<<<<<<- append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, append(NewYears, AgeatElection <- NewYears #Dataframe Creation 57.18+4, after = 1) 57.89+4, after =4 ) 57.97+4, after = 6) 58.85+4, after = 8) 61.97 +4, after = 11) 52.05+4, after = 21) 46.85+4, after = 24) 42.88+4, after = 33 ) 56.18 +4, after = 36) 51.08 +4, after = 39) 51.09+4, after = 42 ) 55.09+4, after = 43) 59.09+4, after = 44) 60.93+4, after =46 ) 62.27+4, after = 48) 55.24+4, after = 51) 56.03+4, after = 53) 69.96+4, after =57 ) 46.42+4, after = 60) 54.54+4, after = 62) 47.46+4, after = 64) 54.09+4, after = 32) Presidents <- data.frame(ElectionYear, ElectoralCollege, PresidentName, AgeatElection) #summary(Presidents$AgeatElection) ``` ```{r,echo=FALSE} # Average Executive Orders by Year for Each President url5 <- "https://en.wikipedia.org/wiki/ List_of_United_States_federal_executive_orders" webpage5 <- read_html(url5) ExecOrder <- html_text(html_nodes(webpage5,'td:nth-child(7)')) ExecOrder <- gsub("\\\n", "", ExecOrder) for ( i in c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) { ExecOrder <- append(ExecOrder, NA, after = i) } Presidents$ExecOrder <- as.numeric(ExecOrder) ``` ```{r,echo=FALSE} #Inaugural Lengths url6 <- "https://en.wikipedia.org/wiki/ United_States_presidential_inauguration" webpage6 <- read_html(url6) Inaug <- html_text(html_nodes(webpage6,'td:nth-child(6)')) Inaug <- gsub("w.*", "", Inaug) Inaug <- gsub(".*\n", NA, Inaug) Presidents$Inaug <- as.numeric(Inaug) Presidents$Inaug[is.na(Presidents$Inaug)] <- round(median(Presidents$Inaug, na.rm = TRUE)) #ReRan and ReWon ReRan <c(1,0,1,1,0,1,0,1,0,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,1,1,1,0,1,0,1,1,0,0,1,0,1,1, ReWon <c(1,0,0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,1,0,0,1, #length(ReWon) Presidents$ReRan <- as.factor(ReRan) Presidents$ReWon <- as.factor(ReWon) #Political Party url7 <- "https://www.thoughtco.com/presidents-and-vice-presidentschart-4051729" webpage7 <- read_html(url7) Party <- html_text(html_nodes(webpage7,'td:nth-child(3)')) Party <- gsub("Union", "Republican", Party) Party Party Party Party Party Party Party Party Party Party Party Party Party Party Party Party Party Party Party Party Party Party <<<<<<<<<<<<<<<<<<<<<<- append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, append(Party, "No Party Afiiliated", after = "Democratic-Republican", after "Democratic-Republican", after "Democratic-Republican", after "Democratic", after = 11) "Republican", after = 21) "Republican", after = 24) "Republican", after = 33 ) "Democratic", after = 36) "Republican", after = 39) "Democratic", after = 42 ) "Democratic", after = 43) "Democratic", after = 44) "Democratic", after =46 ) "Republican", after = 48) "Democratic", after = 51) "Republican", after = 53) "Republican", after =57 ) "Democratic", after = 60) "Republican", after = 62) "Democratic", after = 64) "Republican", after = 32) Presidents$Party <- Party #Number Presidents$Number <- 1:67 #PresidentName ``` ```{r,echo=FALSE} 1) =4 ) = 6) = 8) #Education Levels Educ <- 1:67 Educ[c(19,26,36,37,38,54,55,56,61,62,63,64,65,66)] <- "Graduate" Educ[c(14,28,32,33,34,35,43,44,45,46,47,48,51,52,53,57)] <- "Attended Graduate" Educ[c(3,4,5,6,7,8,9,10,15,16, 20,24,25,27,30,39,40,41,42,49,50,58,59,60,67)] <- "Undergraduate" Educ[c(1,2,11,12,13, 17, 18,29,31)] <- "High School" Educ[c(21,22,23)] <- "No Formal" Presidents$Educ<- Educ #colnames(Presidents) Presidents<- Presidents[, c(10, 1, 3, 2, 9, 4, 5, 6, 7, 8, 11)] ``` ```{r,echo=FALSE} # Type of Political Experience url7 <- "https://www.nytimes.com/interactive/2019/05/23/us/politics/ presidential-experience.html" webpage7 <- read_html(url7) PreviousExp <- html_text(html_nodes(webpage7,'.g-table-all_pres-td-3 p')) PreviousExp <- append(PreviousExp, "President", after = 23) for ( i in c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) { PreviousExp <- append(PreviousExp, "President", after = i) } PreviousExp <- gsub("Major general", "General", PreviousExp) PreviousExp <- gsub("\\*", "", PreviousExp) Presidents$PreviousExp <- PreviousExp ``` ```{r, echo = FALSE} #print("A small sample of the data cleaning I did on one variable. This was done 22 times to create the Presidents dataframe.") # Type of Military Service url8 <- "https://en.wikipedia.org/wiki/ List_of_presidents_of_the_United_States_by_military_service" webpage8 <- read_html(url8) Military <- html_text(html_nodes(webpage8,'td:nth-child(3)')) Military <- gsub("Continental army", "Continental", Military) Military <- gsub("United States Army", "Army", Military) Military Military Military Military Military Military Military Military Military Military <<<<<<<<<<- gsub("United States Naval", "Navy", Military) gsub("Tennessee.*", "Army", Military) gsub(".*Navy.*", "Navy", Military) gsub(".*Continental.*", "militia", Military) gsub(".*militia.*", "State Militia", Military) gsub(".*Militia.*", "State Militia", Military) gsub(".*None.*", "None", Military) gsub(".*Army.*", "Army", Military) gsub(".*National.*", "National Guard", Military) gsub(".*Connecticut.*", "None", Military) #table(Military) Military <- rev(Military) for ( i in c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) { Military <- append(Military, NA, after = i) } Presidents$Military <-Military ``` ```{r,echo=FALSE} # Supreme Court Nominations url9 <- "https://en.wikipedia.org/wiki/ List_of_presidents_of_the_United_States_by_judicial_appointments" webpage9 <- read_html(url9) SupremeC <- html_text(html_nodes(webpage9,'td:nth-child(2)')) SupremeC <- gsub("\\[.*", "", SupremeC) SupremeCourt <- SupremeC[-(45:58)] SupremeCourt <- append(SupremeCourt, NA, after = 23 ) for ( i in c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) { SupremeCourt <- append(SupremeCourt, NA, after = i) } Presidents$SupremeCourt <- as.numeric(SupremeCourt) #Total Judicial Appointees url10 <- "https://en.wikipedia.org/wiki/ List_of_presidents_of_the_United_States_by_judicial_appointments" webpage10 <- read_html(url10) TotalJ <- html_text(html_nodes(webpage10,'td:nth-child(5)')) TotalJ <- gsub("\\[.*", "", TotalJ) TotalJ <- TotalJ[-(45:58)] TotalJ <- append(TotalJ, NA, after = 23 ) for ( i in c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) { TotalJ <- append(TotalJ, NA, after = i) } Presidents$TotalJ <- as.numeric(TotalJ) ``` ```{r,echo=FALSE} #Vetoes url11 <- "https://www.senate.gov/legislative/vetoes/vetoCounts.htm" webpage11 <- read_html(url11) Vetoes <- html_text(html_nodes(webpage11,'td:nth-child(5)')) Vetoes <- trimws(Vetoes) Vetoes <- Vetoes[-(46)] Vetoes <- rev(Vetoes) for ( i in c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) { Vetoes <- append(Vetoes, NA, after = i) } Presidents$Vetoes <- as.numeric(Vetoes) #Overridden Vetoes Overridden <- html_text(html_nodes(webpage11,'tbody td:nth-child(6)')) Overridden <- trimws(Overridden) Overridden <- rev(Overridden) for ( i in c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) { Overridden <- append(Overridden, NA, after = i) } Presidents$Overridden <- as.numeric(Overridden) #Successful Vetoes Presidents$VetoSuccess <- Presidents$Vetoes - Presidents$Overridden ``` ```{r,echo=FALSE} #Height url12 <- "https://en.wikipedia.org/wiki/ Heights_of_presidents_and_presidential_candidates_of_the_United_States" webpage12 <- read_html(url12) Height <- html_text(html_nodes(webpage12,'td:nth-child(4)')) Height <- Height[-c(1:45)] Height <- gsub("c.*", "", Height) Height <- rev(Height) Height Height Height Height Height Height Height Height Height Height <<<<<<<<<<- append(Height, append(Height, append(Height, append(Height, append(Height, append(Height, append(Height, append(Height, append(Height, append(Height, "183", "175", "178", "188", "178", "178", "175", "192", "183", "191", after after after after after after after after after after = = = = = = = = = = 14) 17) 22) 27) 33) 40) 46) 51) 55) 66) Height <- round(as.numeric(Height)/ 30.48,5) In <- Height - floor(Height) Height <- Height - In In <- In * 12 In <- round(In)/10 Height <- Height + In Presidents$Height <- Height ``` ```{r,echo=FALSE, message=FALSE} # Weight url14 <- "https://www.potus.com/presidential-facts/presidential-weight/" webpage14 <- read_html(url14) Weight <- html_text(html_nodes(webpage14,' .column-4 , .column-3 , .column-2')) Weight <- gsub("lbs", "", Weight) Weight <- Weight[-c(1:3)] test2 <- unlist(lapply(split(Weight, ceiling(seq_along(Weight)/3)), paste, collapse=", ")) Test2 <- test2[order(as.numeric(gsub("\\,.*", "", test2)))] Test2 <- gsub(".*,", "", Test2) #str(Test2) Test2 <- unname(Test2) Test2 <- append(Test2, "260", after = 21) Test2 <- append(Test2, NA, after = 23) Test2 <- Test2[-(46)] for ( i in c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) { Test2 <- append(Test2, NA, after = i) } Presidents$Weight <- as.numeric(Test2) ``` ```{r,echo=FALSE} url15 <- "https://www.potus.com/presidential-facts/pardons-commutations/" webpage15 <- read_html(url15) Pardon <- html_text(html_nodes(webpage15,' .column-4 , .column-2')) Pardon <- gsub("lbs", "", Pardon) Pardon <- Pardon[-c(1:2)] test3 <- unlist(lapply(split(Pardon, ceiling(seq_along(Pardon)/2)), paste, collapse=", ")) Test3 <- test3[order(as.numeric(gsub("\\,.*", "", test3)))] Test3 Test3 Test3 Test3 <<<<- gsub("^.*?\\,","", Test3) gsub("\\*","", Test3) gsub("11.*","11", Test3) gsub(",","", Test3) Test3 Test3 Test3 Test3 <<<<- unname(Test3) append(Test3, "1107", after = 21) append(Test3, "NA", after = 23) Test3[-(46)] for ( i in c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) { Test3 <- append(Test3, NA, after = i) } Presidents$Pardon <- as.numeric(Test3) ``` ```{r,echo=FALSE} #Net Worth url16 <- "https://www.potus.com/presidential-facts/net-worth/" webpage16 <- read_html(url16) NetWorth <- html_text(html_nodes(webpage16,' .column-4 , .column-2')) NetWorth <- NetWorth[-c(1:2)] Test4 <- unlist(lapply(split(NetWorth, ceiling(seq_along(NetWorth)/2)), paste, collapse=", ")) Test4 <- Test4[order(as.numeric(gsub("\\,.*", "", Test4)))] Test4 <- gsub("^.*?\\,","", Test4) Test4 Test4 Test4 Test4 Test4 Test4 <<<<<<- gsub("4,500","2100", Test4) gsub("<", "", Test4) unname(Test4) append(Test4, "28", after = 21) append(Test4, "NA", after = 23) Test4[-(46)] for ( i in c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) { Test4 <- append(Test4, NA, after = i) } Presidents$NetWorth <- as.numeric(Test4) ``` ## *Presidents Data Frame* * Final Product of Web Scraping and Data Cleaning ```{r,echo=FALSE} #Final Dataframe Presidents <Presidents[,c(1,2,3,5,4,6,8,7,14,15,16,17,18,11,12,13,21,19,20,22,9,10)] dim(Presidents) head(Presidents,3) tail(Presidents,2) ``` *A small sample of the data cleaning I did on one variable. This was done 22 times to create the Presidents dataframe.* ```{r, warning = FALSE, message= FALSE} # Type of Military Service url8 <- "https://en.wikipedia.org/wiki/ List_of_presidents_of_the_United_States_by_military_service" webpage8 <- read_html(url8) Military <- html_text(html_nodes(webpage8,'td:nth-child(3)')) Military Military Military Military Military Military Military Military Military Military Military Military <<<<<<<<<<<<- gsub("Continental army", "Continental", Military) gsub("United States Army", "Army", Military) gsub("United States Naval", "Navy", Military) gsub("Tennessee.*", "Army", Military) gsub(".*Navy.*", "Navy", Military) gsub(".*Continental.*", "militia", Military) gsub(".*militia.*", "State Militia", Military) gsub(".*Militia.*", "State Militia", Military) gsub(".*None.*", "None", Military) gsub(".*Army.*", "Army", Military) gsub(".*National.*", "National Guard", Military) gsub(".*Connecticut.*", "None", Military) #table(Military) Military <- rev(Military) for ( i in c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) { Military <- append(Military, NA, after = i) } Presidents$Military <-Military ``` ### *General Sense of the Data* My first goal was to get a general sense of the data, and see whether there were some glaring correlations to attack first. ```{r, echo = FALSE,out.width = "54%"} library(purrr) library(tidyr) library(ggplot2) Presidents %>% keep(is.numeric) %>% gather() %>% ggplot(aes(value)) + facet_wrap(~ key, scales = "free") + geom_histogram() ``` ```{r, echo = FALSE,out.width = "54%"} Pres <- Presidents[, c(2,5,6,7,8,9,10,11,12,13,17,18,19,20)] CorP <- cor(Pres, use = "complete.obs") CorTest <- cor.mtest(Pres, conf.level = .95) corrplot.mixed(CorP, lower.col="black", upper = "ellipse", tl.col = "black", number.cex=.7, tl.pos = "lt", tl.cex=.7, p.mat = CorTest$p, sig.level = .05) par(mar = c(10,5,5,10)) q306 <- pairsJDRS(Pres) Pres2 <- Presidents[, c(2,5,8,9,11,17,13)] CorP <- cor(Pres2, use = "complete.obs") CorTest <- cor.mtest(Pres2, conf.level = .95) corrplot.mixed(CorP, lower.col="black", upper = "ellipse", tl.col = "black", number.cex=.7, tl.pos = "lt", tl.cex=.7, p.mat = CorTest$p, sig.level = .05) par(mar = c(10,5,5,10)) q306 <- pairsJDRS(Pres2, cex.lab = 1) ``` * I see there seems to be a correlation between ElectoralCollege and SupremeCourt, so let's do some correlation, t-tests, and bootstraps to find out more. ### *T-Test and Bootstrap* ```{r, echo = FALSE,out.width = "54%"} # One Sample T-Test mean(Presidents$ElectoralCollege, na.rm = T) round(t.test(Presidents$ElectoralCollege)$conf.int, 1) (t.test1 <- t.test(Presidents$ElectoralCollege)) #str(t.test1) #names(t.test1) s <- sample(Presidents$ElectoralCollege, 67, replace = TRUE) M <- mean(s, na.rm = T) MES <- mean(Presidents$ElectoralCollege, na.rm = T) # Bootstrap of Electoral College Mean x <- Presidents$ElectoralCollege n_samp <- 20 n <- length(x) means <- rep(NA, n_samp) urad <- .5 plot(range(x, na.rm = T), c(0, (n_samp + 4)), type = "n", ylab = "", main = "Bootstrapped Electoral College Percents", yaxt = "n", xlab = "Percent") points(x, rep(0,n), cex = urad, pch = 19, col = "black") points(mean(x, na.rm = T), 0, pch = 8, col = "red", cex = 3*urad) text(10, 1.5, "Actual Data") for(i in 5:(n_samp + 4)){ s <- sample(x, n, replace = T) means[i] <- mean(s, na.rm = T) sVals <- as.numeric(names(table(s))) sCounts <- as.vector(table(s)) points(sVals, rep(i,length(sVals)), cex = sqrt(sCounts)*urad, pch = 19, col = "blue") } for(i in 5:(n_samp + 4)){ points(means[i], i, pch = 8, col = "red", cex = 2*urad) } n <- length(Presidents$ElectoralCollege) n_samp <- 10000 bmeans <- rep(NA, n_samp) for(i in 1:n_samp){ s <- sample(Presidents$ElectoralCollege, n, replace=T) bmeans[i] <- mean(s, na.rm = T) } hist(bmeans, col = "blue", main = "Bootstrapped Electoral College Percents", xlab = "Percent", breaks = 50) qqPlot(bmeans, col = "red", ci <- pch=19, main = "qqPlot of Sample Means") quantile(bmeans, c(.025, .975)) round(ci,3) ABC <- round(t.test(Presidents$ElectoralCollege)$conf.int, 1) hist(bmeans, col = "blue", main = "Bootstrapped Sample Mean of Electoral College Percent", xlab = "Percent", breaks = 50) abline(v = ci, lwd = 3, col = "red") abline(v = t.test(Presidents$ElectoralCollege)$conf.int, lwd = 3, col = "green", lty = 2) legend("topright", c("Original CI", "Boot CI"), lwd = 3, col = c("green","red"), lty = c(2,1)) ``` Our one sample t- test gave us a 95% confidence interval of the true Electoral College mean between 66.7% and 75. We ran bootstraps to find the true mean, with the scatterplot showing us in bright red the average sample means and the histograms showing the spread of those means. Ultimately though, our bootstrap confidence interval was only slightly more accurate. However, from a practical standpoint, the fact that only 50% of the Electoral college is needed to win the presidency, and the average win is 70%, shows that most presidential elections have not been as neck and neck as in recent years. ```{r,echo = FALSE, results = 'hide',out.width = "54%"} bootplot <- function(data, nrep, plotit = F){ N <- nrow(data) slopes <- as.list(rep(NA, nrep)) for(i in 1:nrep){ s <- sample(1:N, N , replace = T) sVals <- as.numeric(names(table(s))) sCounts <- as.vector(table(s)) fakeData <- data[s, ] cor1 <- cor(fakeData[, 1], fakeData[,2]) lm1 <- lm(fakeData[, 2] ~ fakeData[, 1]) slopes[[i]] <- lm1$coefficients if (plotit == T) { plot(data, type = "n") points(data[sVals, 1], data[sVals, 2], cex = sqrt(sCounts), pch = 19, col = "red") abline(lm1$coef, col = "blue", lwd = 3) title(main = paste("cor = ",round(cor1, 2),", slope = ", round(lm1$coef[2], 2), "Rep = ", i)) Sys.sleep(.5) } } return(slopes) } ``` ### *Bootstrap of Electoral College and Supreme Court Correlation* ```{r, echo = FALSE,out.width = "54%"} cor1 <- cor(Presidents$SupremeCourt, Presidents$ElectoralCollege, use = "complete.obs") cor1 cor.test(Presidents$SupremeCourt, Presidents$ElectoralCollege, use = "complete.obs") lm1 <- lm(Presidents$ElectoralCollege ~ Presidents$SupremeCourt) summary(lm1) names(lm1) plot( Presidents$SupremeCourt, Presidents$ElectoralCollege,main = paste("r = ", round(cor1,2) , ", slope = ", round(lm1$coef[2],2)), pch = 19, col = "red") abline(lm1$coef, col = "blue", lwd = 3) Max <- cbind(Presidents$SupremeCourt, Presidents$ElectoralCollege) N <- nrow(Max) n_samp <- 10000 corResults <- rep(NA, n_samp) bResults <- rep(NA, n_samp) for(i in 1:n_samp){ s <- sample(1:N, N , replace = T) sVals <- as.numeric(names(table(s))) sCounts <fakeSC <fakeEC <- as.vector(table(s)) rep(Presidents$SupremeCourt[sVals], sCounts) rep(Presidents$ElectoralCollege[sVals], sCounts) cor1 <- cor(fakeSC, fakeEC, use = "complete.obs") lm1 <- lm(fakeEC ~ fakeSC) corResults[i] <- cor1 bResults[i] <- lm1$coef[2] } ci_r <- quantile(corResults, c(.025, .975), na.rm = T) ci_slope <- quantile(bResults, c(.025, .975), na.rm = T) hist(corResults, col = "blue", main = "Bootstrapped Correlations", xlab = "Sample Correlation", breaks = 50) abline(v = ci_r, lwd = 3, col = "red") abline(v = cor.test(Presidents$SupremeCourt, Presidents$ElectoralCollege) $conf.int, lwd = 3, col = "green", lty = 2) legend(-0.3, 450, c("Theoretical CI","Boot CI"), lwd = 3, col = c("green","red"), lty = c(2,1)) qqPlot(corResults, main = "qqPlot of corResults") #Bootstrap Line Graphs TTest <- t.test(Presidents$ElectoralCollege, Presidents$SupremeCourt) (TestCor <- cor(Presidents$ElectoralCollege, Presidents$SupremeCourt, use = "complete.obs")) qqPlot(Presidents$ElectoralCollege, col = 2, pch = 19) cor1 <- cor(Presidents$SupremeCourt, Presidents$ElectoralCollege, use = "complete.obs") bootlines <- bootplot(Max, 100, plotit = F) lm1 <- lm(Presidents$ElectoralCollege ~ Presidents$SupremeCourt) plot(Max, main = paste("Electoral College and Supreme Court Data, r = ", round(cor1,2) , ", slope = ", round(lm1$coef[2],2)), pch = 19, col = "red", xlab = " Number of Supreme Court Confirmations", ylab = " Electoral College Percent") abline(lm(Presidents$ElectoralCollege ~ Presidents$SupremeCourt), col = "blue", lwd = 3) for (i in 1:50){ abline(bootlines[[i]], col = "green", lty = 2) } ``` We explored the relationship between Supreme Court confirmations and Electoral College percentages, and we found that there is a statistically significant positve relationship between the two. We determined from our bootstrap that the true correlation is around 0.44. So an increase in Supreme Court confirmations is correlated with an increase in election performance. This makes sense, because judicial appoitnment is an important aspect of a candidate for voters. ### *Permutation* ```{r, echo = FALSE, out.width = "54%"} fakeEC <- sample(fakeEC) plot(Presidents$SupremeCourt, fakeEC, pch=19, col="red", main="") mtext("Fake Relationship between Supreme Court and Electoral College", cex=1.2, line=1) mtext(paste("Correlation =", round(cor(Presidents$SupremeCourt, fakeEC, use = "complete.obs"),2)), line=0, cex=1) n_samp <- 10000 corResults <- rep(NA, n_samp) for(i in 1:n_samp){ corResults[i] <- cor(Presidents$SupremeCourt, sample(Presidents$ElectoralCollege), use = "complete.obs") } (truecor <- mean(abs(corResults) >= abs(cor(Presidents$SupremeCourt,Presidents$ElectoralCollege, use = "complete.obs")))) hist(corResults, col = "yellow", main = "", xlab = "Correlations", breaks = 50) mtext("Permuted Sample Correlations", cex = 1.2, line = 1) mtext(paste("Permuted P-value =",round(truecor,4)), cex = 1, line = 0) abline(v = cor(Presidents$SupremeCourt,Presidents$ElectoralCollege, use = "complete.obs"), col="blue", lwd=3) text(cor(Presidents$SupremeCourt,Presidents$ElectoralCollege, use = "complete.obs")-.02, 200 ,paste("Actual Correlation =", round(cor(Presidents$SupremeCourt,Presidents$ElectoralCollege, use = "complete.obs"),2)),srt = 90) ``` We completed a permutation test to see if the correlation we found in our bootstrap was dependent on the order of variables. So we scrambled up Supreme Court variables and ran the test again. We found that a correlation of .44 was statsitcially significant and highly unlikely, which just validifies our previous evaluation. There is a statistically significant correlation between Supreme court confirmations and election results. ### *Multiple Regression* We were still interested in exploring the Electoral College variable further, so we attempted to create a prodicting model. We did not use variables that had NA's, and we used the stepAIC function to get a backwards stepwise regression. We also looked at some individual variables to test their strength. ```{r, echo = FALSE,out.width = "54%", include = FALSE} #names(Presidents) #str(Presidents) library(MASS) full.model <- lm(ElectoralCollege ~ AgeatElection + Party +ElectionYear +Inaug +ExecOrder + SupremeCourt + TotalJ +Vetoes +Overridden + Educ + PreviousExp +Military + Pardon + Height +Weight + NetWorth +ReWon , data = Presidents) # Stepwise regression model step.model <- stepAIC(full.model, direction = "both", trace = T) print("Final Output of stepAIC backwards regressions") #summary(step.model) ``` ```{r, echo = FALSE,out.width = "54%", return = 'hide'} myResPlots2(step.model) (Mod1 <- lm(ElectoralCollege ~ #summary(Mod1) (Mod1 <- lm(ElectoralCollege ~ #summary(Mod1) SupremeCourt, data = Presidents)) ExecOrder, data = Presidents)) ``` Here our various regression models were trying to predict Electoral College success. By using the stepAIC backwards regression function, we were left with a regression equation with about 4 less terms than we started with, and a higher AIC. Our final model suggested that previous military service, winning reelection, amount of vetoes, and independent status may have some statistically significant effects on election winning. (However, we had a r-squared of .87, which is not as effective as it could be. Almost all of our variables were not statistically significant either because of the size of the data set,the content of the variables, or the fact that presidential electoral results aren't entirely predictable.) We would definately like to explore some more complex models concerning this. As an alternative, we also took a look at some one off variables to see which one variable could best describe Electoral College Percent by itself, and the result was Supreme Court confirmations (with an r-squared of .1965), followed by Executive Orders(r-squared = .11). Our final model shows residuals that are normally distributed and do not show signs of heterskedactisity. Our data set seems to have a good distribution, but its naturally small size (only 45 presidents) is probably responsible for our low r-squared values. ## *Fun Facts* ```{r, echo = FALSE,out.width = "54%"} hist(AgeatElection, col = "yellow", main = "Histogram of Presidential Ages at Election") abline(v = mean(AgeatElection), col = "red", lw = 3, las = 2) ``` ## *Out of presidents running for reelection, 70% have won, while 30% have lost.* ### Tallest Presidents: Abraham Lincoln, Lyndon B.Johnson, Donald Trump, George Washington, and Thomas Jefferson ```{r, echo = FALSE} abcd <- head(Presidents[order(-Presidents$Height),],8) ``` Shortest Presidents: James Madison, Benjamin Harrison, Martin Van Buren, William McKinley, John Quincy Adams ```{r, echo = FALSE,out.width = "54%"} abcde <- tail(Presidents[order(-Presidents$Height),],7) boxplot(AgeatElection ~ Educ, col = "red", main = "Age at Election by Education", xlab = "Type of Education", ylab = "Age", las = 2, cex.axis = . 6) ``` *There seems to be no large discrepencies in age by education type, though this should be looked at closer!* ## Proportional Pie Chart of Education of Presidents ```{r, echo = FALSE,out.width = "54%"} library(plotrix) pie3D(table(Presidents$Educ), theta=pi/4, explode=0.1, labels=names(table(Presidents$Educ)), labelcex = 1.04) ``` ## Proportional Pie Chart of Previous Experience of Presidents ```{r, echo = FALSE,out.width = "54%"} library(plotrix) pie3D(table(Presidents$PreviousExp), theta=pi/4, explode=0.1, labels=names(table(Presidents$PreviousExp)), labelcex = 1.04) ``` ## Proportional Pie Chart of Party of Presidents ```{r, echo = FALSE,out.width = "54%"} library(plotrix) pie3D(table(Presidents$Party), theta=pi/4, explode=0.1, labels=names(table(Presidents$Party)), labelcex = 1.04) ``` # *Conclusion* This has just been a very brief look into some of the possibilities opened up by the Presidents data set. To summarize, I compiled various data points from all American presidents by web scraping from several sources to create a new data frame. We looked at the correlations and distributions of the variables in our data frame and choice several relationships to examine closer. We looked at the relationship between Electoral College Percentage and Supreme Court Confirmations and found there to be a statistically significant positive relationship between them. Then we tried various regression models to try to predict Election success. By using the stepAIC backwards regression function, we were left with a regression equation that suggested that previous military service, winning reelection, amount of vetoes, and independent status may have some affects of election winning. We would definately like to explore some more complex models concerning this. We also took a look at some one off variables to see which one variable could best describe Electoral College Percent, and the result was Supreme Court confirmations, followed by Executive Orders. Finally, we looked at some fun facts of the president's ages, heights, and experiences. *There is much more to discover and learn. This dataset will be continued to be updated and grow* ![](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/ The_Seal_of_the_President_of_the_United_States.jpg/220pxThe_Seal_of_the_President_of_the_United_States.jpg) ```