Uploaded by Marcos Barrios

SDS 230 - Final Code

advertisement
--title: "Stats 230 Final Project"
author: ''
date: "4/19/2020"
output:
pdf_document: default
html_document: default
out.height: 20%
out.width: 20%
fig_width: 2
fig_height: 1
--```{r global_options, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE)
```
```{r setup, echo = F, warning = FALSE, message = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(car)
library(leaps)
library(lubridate)
library(rvest)
library(olsrr)
library(corrplot)
library(leaps)
source("http://www.reuningscherer.net/s&ds230/Rfuncs/regJDRS.txt")
```
# *Presidential Data Analysis*
![](https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/
gettyimages-515392598-master-1503595848.jpg)
### *Introduction*
During the 2020 election year, it seems appropriate to take a look at
presidential data, from George Washington to Donald Trump, to try to find
patterns over time that could help us understand exactly what makes a
president. A wide range of variables, covering physical attributes, to
policy effectiveness, to political and educational experience, will be
studied and searched for significant correlations. One of the main goals of
this project was to create a readable data set of presidential data from
all of the presidents (which does not exist as of now).
### *Data*
The data frame is 22 variables long and 67 rows deep. Now, while there have
been 44 presidents, there have been 58 elections, and 67 transitions of
power. Due to death, assassination, and resigning, presidents come and go
in non election years. To compensate for this messy distribution, I found
it simpler to count each transition of power as a seperate row. For
example, George Washington's first term counts as the first row, while his
second term counts as the second. Abraham Lincoln takes up two rows because
of his two elections, but the succession of Andrew Johnson after Lincoln's
assassination gives Johnson his own row.
### *Variables*
* ElectoralCollege: Proportion of electoral votes won by president in
election.
* AgeatElection : The age of the president at their election in years.
* Inaug: The number of words in the presidents inauguration speeches.
* ExecOrder: Average number of executive orders per year of time in office.
NA's exists because they are already accounted for. For example, George
Washington takes up two rows. Over the eight ears of his presidency, he
averaged 1 executive order a year, this is only recorded once.
* Supreme Court: Confirmed Supreme Court Nominations over course of
presidency.
* TotalJ: Total confirmed judicial appointees over course of presidency.
* VetoSuccess: Total number of vetoes not overridden by Congress over
course of presidency.
* Educ: Highest level of education achieved.
* PreviousExp: Highest level of political experience achieved before
presidency.
* Height: In Feet and inches. 6.2 reads as 6 feet 2 inches.
* NewtWorth: NetWorthorth in millions. 1 represents less than < 1 million.
* ReRan: 0 <- Did not run for reelection, 1 <- Ran for reelection.
* ReWon: 0 <- Did not win reelection, 1 <- Won reelection.
### *Data Cleaning Process*
Every variable in the Presidents data set was web scraped and cleaned
according to its unique characterristics. I used sources such as
Senate.org, potus.facts, and various Wikipedia pages to accumulate all of
these data points into one data frame. While each variable had a different
process, I used for loops to append most of my raw data in order to make it
fit my 67 rows. If you search for presidential data, you will get 44 or 45
results for any given variable, but I had to split all of this data between
terms and introduce incidental presidencies at specific points. I
encountered many issues in aligning my NA's to correspond correctly with
gaps in presidential terms and with the reelections of presidents. I also
ran into trouble web scraping from so many different sources, I had to use
my selector tool very sparingly and carefully. Some of the functions I used
to prepare this data frame were: gsub, for loop, append, trimws,
as.numeric, round, rev, unname, etc.
```{r,echo=FALSE}
#Election Year Variable
url1 <- "https://www.infoplease.com/us/government/elections/presidentialelections-1789-2016"
webpage1 <- read_html(url1)
ElectionYear <- html_text(html_nodes(webpage1,'.sgmltable b'))
#(ElectionYear)
for ( i in c(14,17,22,27,33,39,46,51,55)) {
ElectionYear <- append(ElectionYear, NA, after = i)
}
ElectionYear <- ElectionYear
#Electoral College Percentage Win
url2 <- "https://en.wikipedia.org/wiki/United_States_presidential_election"
webpage2 <- read_html(url2)
ElectionR <- html_text(html_nodes(webpage2,'tr:nth-child(18) .nowrap ,
tr:nth-child(13) .nowrap , td:nth-child(7) .nowrap'))
ElectionR <-gsub("/", "", ElectionR)
ElectionR <- gsub("\\s+", " ", ElectionR)
ElectionR
ElectionR
ElectionR
ElectionR
ElectionR
ElectionR
ElectionR
ElectionR
ElectionR
<-gsub("69 138", "069 69", ElectionR)
<-gsub("1 138", "132 132", ElectionR)
<-gsub("1 264", "071 138", ElectionR)
<-gsub("73 276", "073 138", ElectionR)
<-gsub("162 176", "162 176", ElectionR)
<-gsub("113 176", "122 175", ElectionR)
<-gsub("128 217", "128 217", ElectionR)
<-gsub("183 217", "183 217", ElectionR)
<-gsub("73 276", "231 232", ElectionR)
ElectionR
ElectionR
ElectionR
ElectionR
ElectionR
<-gsub("218 232", "084 261", ElectionR)
<-gsub("74 261", "178 261", ElectionR)
<-gsub("171 261", "219 286", ElectionR)
<-gsub("189 286", "219 286", ElectionR)
<-gsub("147 294", "170 294", ElectionR)
Col1 <- substr(ElectionR, 1, 3)
Col2 <- substr(ElectionR, 5, 7)
Col1 <- as.numeric(Col1)
Col2 <- as.numeric(Col2)
ElectoralCollege <- round(Col1/Col2, 2)
for ( i in c(14,17,22,27,33,39,46,51,55)) {
ElectoralCollege <- append(ElectoralCollege, NA, after = i)
}
ElectoralCollege <- ElectoralCollege
#President Name by Election Year
url3 <- "https://en.wikipedia.org/wiki/United_States_presidential_election"
webpage3 <- read_html(url3)
Name <- html_text(html_nodes(webpage3,'.vcard:nth-child(2) .fn'))
Name <- gsub("/.*", "", Name)
Name <- gsub("(incumbent)", "", Name)
Name <- gsub("\\()", "", Name)
Name <- gsub("\\\n", "", Name)
Name <- trimws(Name)
#length(unique(Name))
Name
Name
Name
Name
Name
Name
Name
Name
Name
<<<<<<<<<-
append(Name,
append(Name,
append(Name,
append(Name,
append(Name,
append(Name,
append(Name,
append(Name,
append(Name,
"John Tyler", after = 14)
"Millard Filmore", after = 17)
"Andrew Johnson", after = 22)
"Chester A. Arthur", after = 27)
"Theodore Roosevelt", after = 33)
"Calvin Coolidge", after = 40)
"Harry S. Truman", after = 46)
"Lyndon B. Johnson", after = 51)
"Gerald Ford", after = 55)
PresidentName <- Name
ElectionYear <- as.numeric(ElectionYear)
#ElectoralCollege
#PresidentName
# Age at Start of Presidency
url4 <- "https://en.wikipedia.org/wiki/
List_of_presidents_of_the_United_States_by_age"
webpage4 <- read_html(url4)
Age <- html_text(html_nodes(webpage4,'td:nth-child(4)'))
Age <- Age[1:45]
Age <- gsub("days.*", "", Age)
Age <- gsub("51 Years, 6", "51 Years, 06", Age)
Days <- as.numeric(substr(Age, 10, 13)) / 365
Year <- as.numeric(substr(Age, 1, 2))
NewYears <- round(Year + Days, 2)
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
NewYears
<<<<<<<<<<<<<<<<<<<<<<-
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
append(NewYears,
AgeatElection <- NewYears
#Dataframe Creation
57.18+4, after = 1)
57.89+4, after =4 )
57.97+4, after = 6)
58.85+4, after = 8)
61.97 +4, after = 11)
52.05+4, after = 21)
46.85+4, after = 24)
42.88+4, after = 33 )
56.18 +4, after = 36)
51.08 +4, after = 39)
51.09+4, after = 42 )
55.09+4, after = 43)
59.09+4, after = 44)
60.93+4, after =46 )
62.27+4, after = 48)
55.24+4, after = 51)
56.03+4, after = 53)
69.96+4, after =57 )
46.42+4, after = 60)
54.54+4, after = 62)
47.46+4, after = 64)
54.09+4, after = 32)
Presidents <- data.frame(ElectionYear, ElectoralCollege, PresidentName,
AgeatElection)
#summary(Presidents$AgeatElection)
```
```{r,echo=FALSE}
#
Average Executive Orders by Year for Each President
url5 <- "https://en.wikipedia.org/wiki/
List_of_United_States_federal_executive_orders"
webpage5 <- read_html(url5)
ExecOrder <- html_text(html_nodes(webpage5,'td:nth-child(7)'))
ExecOrder <- gsub("\\\n", "", ExecOrder)
for ( i in
c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) {
ExecOrder <- append(ExecOrder, NA, after = i)
}
Presidents$ExecOrder <- as.numeric(ExecOrder)
```
```{r,echo=FALSE}
#Inaugural Lengths
url6 <- "https://en.wikipedia.org/wiki/
United_States_presidential_inauguration"
webpage6 <- read_html(url6)
Inaug <- html_text(html_nodes(webpage6,'td:nth-child(6)'))
Inaug <- gsub("w.*", "", Inaug)
Inaug <- gsub(".*\n", NA, Inaug)
Presidents$Inaug <- as.numeric(Inaug)
Presidents$Inaug[is.na(Presidents$Inaug)] <- round(median(Presidents$Inaug,
na.rm = TRUE))
#ReRan and ReWon
ReRan <c(1,0,1,1,0,1,0,1,0,1,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,1,1,1,0,1,0,1,1,0,0,1,0,1,1,
ReWon <c(1,0,0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,1,0,0,1,
#length(ReWon)
Presidents$ReRan <- as.factor(ReRan)
Presidents$ReWon <- as.factor(ReWon)
#Political Party
url7 <- "https://www.thoughtco.com/presidents-and-vice-presidentschart-4051729"
webpage7 <- read_html(url7)
Party <- html_text(html_nodes(webpage7,'td:nth-child(3)'))
Party <- gsub("Union", "Republican", Party)
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
Party
<<<<<<<<<<<<<<<<<<<<<<-
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
append(Party,
"No Party Afiiliated", after =
"Democratic-Republican", after
"Democratic-Republican", after
"Democratic-Republican", after
"Democratic", after = 11)
"Republican", after = 21)
"Republican", after = 24)
"Republican", after = 33 )
"Democratic", after = 36)
"Republican", after = 39)
"Democratic", after = 42 )
"Democratic", after = 43)
"Democratic", after = 44)
"Democratic", after =46 )
"Republican", after = 48)
"Democratic", after = 51)
"Republican", after = 53)
"Republican", after =57 )
"Democratic", after = 60)
"Republican", after = 62)
"Democratic", after = 64)
"Republican", after = 32)
Presidents$Party <- Party
#Number
Presidents$Number <- 1:67
#PresidentName
```
```{r,echo=FALSE}
1)
=4 )
= 6)
= 8)
#Education Levels
Educ <- 1:67
Educ[c(19,26,36,37,38,54,55,56,61,62,63,64,65,66)] <- "Graduate"
Educ[c(14,28,32,33,34,35,43,44,45,46,47,48,51,52,53,57)] <- "Attended
Graduate"
Educ[c(3,4,5,6,7,8,9,10,15,16,
20,24,25,27,30,39,40,41,42,49,50,58,59,60,67)] <- "Undergraduate"
Educ[c(1,2,11,12,13, 17, 18,29,31)] <- "High School"
Educ[c(21,22,23)] <- "No Formal"
Presidents$Educ<- Educ
#colnames(Presidents)
Presidents<- Presidents[, c(10, 1, 3, 2, 9, 4, 5, 6, 7, 8, 11)]
```
```{r,echo=FALSE}
# Type of Political Experience
url7 <- "https://www.nytimes.com/interactive/2019/05/23/us/politics/
presidential-experience.html"
webpage7 <- read_html(url7)
PreviousExp <- html_text(html_nodes(webpage7,'.g-table-all_pres-td-3 p'))
PreviousExp <- append(PreviousExp, "President", after = 23)
for ( i in
c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) {
PreviousExp <- append(PreviousExp, "President", after = i)
}
PreviousExp <- gsub("Major general", "General", PreviousExp)
PreviousExp <- gsub("\\*", "", PreviousExp)
Presidents$PreviousExp <- PreviousExp
```
```{r, echo = FALSE}
#print("A small sample of the data cleaning I did on one variable. This was
done 22 times to create the Presidents dataframe.")
# Type of Military Service
url8 <- "https://en.wikipedia.org/wiki/
List_of_presidents_of_the_United_States_by_military_service"
webpage8 <- read_html(url8)
Military <- html_text(html_nodes(webpage8,'td:nth-child(3)'))
Military <- gsub("Continental army", "Continental", Military)
Military <- gsub("United States Army", "Army", Military)
Military
Military
Military
Military
Military
Military
Military
Military
Military
Military
<<<<<<<<<<-
gsub("United States Naval", "Navy", Military)
gsub("Tennessee.*", "Army", Military)
gsub(".*Navy.*", "Navy", Military)
gsub(".*Continental.*", "militia", Military)
gsub(".*militia.*", "State Militia", Military)
gsub(".*Militia.*", "State Militia", Military)
gsub(".*None.*", "None", Military)
gsub(".*Army.*", "Army", Military)
gsub(".*National.*", "National Guard", Military)
gsub(".*Connecticut.*", "None", Military)
#table(Military)
Military <- rev(Military)
for ( i in
c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) {
Military <- append(Military, NA, after = i)
}
Presidents$Military <-Military
```
```{r,echo=FALSE}
# Supreme Court Nominations
url9 <- "https://en.wikipedia.org/wiki/
List_of_presidents_of_the_United_States_by_judicial_appointments"
webpage9 <- read_html(url9)
SupremeC <- html_text(html_nodes(webpage9,'td:nth-child(2)'))
SupremeC <- gsub("\\[.*", "", SupremeC)
SupremeCourt <- SupremeC[-(45:58)]
SupremeCourt <- append(SupremeCourt, NA, after = 23 )
for ( i in
c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) {
SupremeCourt <- append(SupremeCourt, NA, after = i)
}
Presidents$SupremeCourt <- as.numeric(SupremeCourt)
#Total Judicial Appointees
url10 <- "https://en.wikipedia.org/wiki/
List_of_presidents_of_the_United_States_by_judicial_appointments"
webpage10 <- read_html(url10)
TotalJ <- html_text(html_nodes(webpage10,'td:nth-child(5)'))
TotalJ <- gsub("\\[.*", "", TotalJ)
TotalJ <- TotalJ[-(45:58)]
TotalJ <- append(TotalJ, NA, after = 23 )
for ( i in
c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) {
TotalJ <- append(TotalJ, NA, after = i)
}
Presidents$TotalJ <- as.numeric(TotalJ)
```
```{r,echo=FALSE}
#Vetoes
url11 <- "https://www.senate.gov/legislative/vetoes/vetoCounts.htm"
webpage11 <- read_html(url11)
Vetoes <- html_text(html_nodes(webpage11,'td:nth-child(5)'))
Vetoes <- trimws(Vetoes)
Vetoes <- Vetoes[-(46)]
Vetoes <- rev(Vetoes)
for ( i in
c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) {
Vetoes <- append(Vetoes, NA, after = i)
}
Presidents$Vetoes <- as.numeric(Vetoes)
#Overridden Vetoes
Overridden <- html_text(html_nodes(webpage11,'tbody td:nth-child(6)'))
Overridden <- trimws(Overridden)
Overridden <- rev(Overridden)
for ( i in
c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) {
Overridden <- append(Overridden, NA, after = i)
}
Presidents$Overridden <- as.numeric(Overridden)
#Successful Vetoes
Presidents$VetoSuccess <- Presidents$Vetoes - Presidents$Overridden
```
```{r,echo=FALSE}
#Height
url12 <- "https://en.wikipedia.org/wiki/
Heights_of_presidents_and_presidential_candidates_of_the_United_States"
webpage12 <- read_html(url12)
Height <- html_text(html_nodes(webpage12,'td:nth-child(4)'))
Height <- Height[-c(1:45)]
Height <- gsub("c.*", "", Height)
Height <- rev(Height)
Height
Height
Height
Height
Height
Height
Height
Height
Height
Height
<<<<<<<<<<-
append(Height,
append(Height,
append(Height,
append(Height,
append(Height,
append(Height,
append(Height,
append(Height,
append(Height,
append(Height,
"183",
"175",
"178",
"188",
"178",
"178",
"175",
"192",
"183",
"191",
after
after
after
after
after
after
after
after
after
after
=
=
=
=
=
=
=
=
=
=
14)
17)
22)
27)
33)
40)
46)
51)
55)
66)
Height <- round(as.numeric(Height)/ 30.48,5)
In <- Height - floor(Height)
Height <- Height - In
In <- In * 12
In <- round(In)/10
Height <- Height + In
Presidents$Height <- Height
```
```{r,echo=FALSE, message=FALSE}
# Weight
url14 <- "https://www.potus.com/presidential-facts/presidential-weight/"
webpage14 <- read_html(url14)
Weight <- html_text(html_nodes(webpage14,'
.column-4 , .column-3 , .column-2'))
Weight <- gsub("lbs", "", Weight)
Weight <- Weight[-c(1:3)]
test2 <- unlist(lapply(split(Weight, ceiling(seq_along(Weight)/3)), paste,
collapse=", "))
Test2 <- test2[order(as.numeric(gsub("\\,.*", "", test2)))]
Test2 <- gsub(".*,", "", Test2)
#str(Test2)
Test2 <- unname(Test2)
Test2 <- append(Test2, "260", after = 21)
Test2 <- append(Test2, NA, after = 23)
Test2 <- Test2[-(46)]
for ( i in
c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) {
Test2 <- append(Test2, NA, after = i)
}
Presidents$Weight <- as.numeric(Test2)
```
```{r,echo=FALSE}
url15 <- "https://www.potus.com/presidential-facts/pardons-commutations/"
webpage15 <- read_html(url15)
Pardon <- html_text(html_nodes(webpage15,'
.column-4 , .column-2'))
Pardon <- gsub("lbs", "", Pardon)
Pardon <- Pardon[-c(1:2)]
test3 <- unlist(lapply(split(Pardon, ceiling(seq_along(Pardon)/2)), paste,
collapse=", "))
Test3 <- test3[order(as.numeric(gsub("\\,.*", "", test3)))]
Test3
Test3
Test3
Test3
<<<<-
gsub("^.*?\\,","", Test3)
gsub("\\*","", Test3)
gsub("11.*","11", Test3)
gsub(",","", Test3)
Test3
Test3
Test3
Test3
<<<<-
unname(Test3)
append(Test3, "1107", after = 21)
append(Test3, "NA", after = 23)
Test3[-(46)]
for ( i in
c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) {
Test3 <- append(Test3, NA, after = i)
}
Presidents$Pardon <- as.numeric(Test3)
```
```{r,echo=FALSE}
#Net Worth
url16 <- "https://www.potus.com/presidential-facts/net-worth/"
webpage16 <- read_html(url16)
NetWorth <- html_text(html_nodes(webpage16,'
.column-4 , .column-2'))
NetWorth <- NetWorth[-c(1:2)]
Test4 <- unlist(lapply(split(NetWorth, ceiling(seq_along(NetWorth)/2)),
paste, collapse=", "))
Test4 <- Test4[order(as.numeric(gsub("\\,.*", "", Test4)))]
Test4 <- gsub("^.*?\\,","", Test4)
Test4
Test4
Test4
Test4
Test4
Test4
<<<<<<-
gsub("4,500","2100", Test4)
gsub("<", "", Test4)
unname(Test4)
append(Test4, "28", after = 21)
append(Test4, "NA", after = 23)
Test4[-(46)]
for ( i in
c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) {
Test4 <- append(Test4, NA, after = i)
}
Presidents$NetWorth <- as.numeric(Test4)
```
## *Presidents Data Frame*
* Final Product of Web Scraping and Data Cleaning
```{r,echo=FALSE}
#Final Dataframe
Presidents <Presidents[,c(1,2,3,5,4,6,8,7,14,15,16,17,18,11,12,13,21,19,20,22,9,10)]
dim(Presidents)
head(Presidents,3)
tail(Presidents,2)
```
*A small sample of the data cleaning I did on one variable. This was done
22 times to create the Presidents dataframe.*
```{r, warning = FALSE, message= FALSE}
# Type of Military Service
url8 <- "https://en.wikipedia.org/wiki/
List_of_presidents_of_the_United_States_by_military_service"
webpage8 <- read_html(url8)
Military <- html_text(html_nodes(webpage8,'td:nth-child(3)'))
Military
Military
Military
Military
Military
Military
Military
Military
Military
Military
Military
Military
<<<<<<<<<<<<-
gsub("Continental army", "Continental", Military)
gsub("United States Army", "Army", Military)
gsub("United States Naval", "Navy", Military)
gsub("Tennessee.*", "Army", Military)
gsub(".*Navy.*", "Navy", Military)
gsub(".*Continental.*", "militia", Military)
gsub(".*militia.*", "State Militia", Military)
gsub(".*Militia.*", "State Militia", Military)
gsub(".*None.*", "None", Military)
gsub(".*Army.*", "Army", Military)
gsub(".*National.*", "National Guard", Military)
gsub(".*Connecticut.*", "None", Military)
#table(Military)
Military <- rev(Military)
for ( i in
c(1,4,6,8,11,21,24,33,36,39,42,43,44,46,48,51,53,57,60,62,64,32)) {
Military <- append(Military, NA, after = i)
}
Presidents$Military <-Military
```
### *General Sense of the Data*
My first goal was to get a general sense of the data, and see whether there
were some glaring correlations to attack first.
```{r, echo = FALSE,out.width = "54%"}
library(purrr)
library(tidyr)
library(ggplot2)
Presidents %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
```
```{r, echo = FALSE,out.width = "54%"}
Pres <- Presidents[, c(2,5,6,7,8,9,10,11,12,13,17,18,19,20)]
CorP <- cor(Pres, use = "complete.obs")
CorTest <- cor.mtest(Pres, conf.level = .95)
corrplot.mixed(CorP, lower.col="black", upper = "ellipse", tl.col =
"black", number.cex=.7,
tl.pos = "lt", tl.cex=.7, p.mat = CorTest$p, sig.level = .05)
par(mar = c(10,5,5,10))
q306 <- pairsJDRS(Pres)
Pres2 <- Presidents[, c(2,5,8,9,11,17,13)]
CorP <- cor(Pres2, use = "complete.obs")
CorTest <- cor.mtest(Pres2, conf.level = .95)
corrplot.mixed(CorP, lower.col="black", upper = "ellipse", tl.col =
"black", number.cex=.7,
tl.pos = "lt", tl.cex=.7, p.mat = CorTest$p, sig.level = .05)
par(mar = c(10,5,5,10))
q306 <- pairsJDRS(Pres2, cex.lab = 1)
```
* I see there seems to be a correlation between ElectoralCollege and
SupremeCourt, so let's do some correlation, t-tests, and bootstraps to find
out more.
### *T-Test and Bootstrap*
```{r, echo = FALSE,out.width = "54%"}
# One Sample T-Test
mean(Presidents$ElectoralCollege, na.rm = T)
round(t.test(Presidents$ElectoralCollege)$conf.int, 1)
(t.test1 <- t.test(Presidents$ElectoralCollege))
#str(t.test1)
#names(t.test1)
s <-
sample(Presidents$ElectoralCollege, 67, replace = TRUE)
M <- mean(s, na.rm = T)
MES <- mean(Presidents$ElectoralCollege, na.rm = T)
# Bootstrap of Electoral College Mean
x <- Presidents$ElectoralCollege
n_samp <- 20
n <- length(x)
means <- rep(NA, n_samp)
urad <- .5
plot(range(x, na.rm = T), c(0, (n_samp + 4)), type = "n", ylab = "",
main = "Bootstrapped Electoral College Percents", yaxt = "n", xlab =
"Percent")
points(x, rep(0,n), cex = urad, pch = 19, col = "black")
points(mean(x, na.rm = T), 0, pch = 8, col = "red", cex = 3*urad)
text(10, 1.5, "Actual Data")
for(i in 5:(n_samp + 4)){
s <- sample(x, n, replace = T)
means[i] <- mean(s, na.rm = T)
sVals <- as.numeric(names(table(s)))
sCounts <- as.vector(table(s))
points(sVals, rep(i,length(sVals)), cex = sqrt(sCounts)*urad, pch = 19,
col = "blue")
}
for(i in 5:(n_samp + 4)){
points(means[i], i, pch = 8, col = "red", cex = 2*urad)
}
n <- length(Presidents$ElectoralCollege)
n_samp <- 10000
bmeans <- rep(NA, n_samp)
for(i in 1:n_samp){
s <-
sample(Presidents$ElectoralCollege, n, replace=T)
bmeans[i] <-
mean(s, na.rm = T)
}
hist(bmeans, col = "blue", main = "Bootstrapped Electoral College
Percents", xlab = "Percent", breaks = 50)
qqPlot(bmeans, col = "red",
ci <-
pch=19, main = "qqPlot of Sample Means")
quantile(bmeans, c(.025, .975))
round(ci,3)
ABC <- round(t.test(Presidents$ElectoralCollege)$conf.int, 1)
hist(bmeans, col = "blue", main = "Bootstrapped Sample Mean of Electoral
College Percent", xlab = "Percent", breaks = 50)
abline(v = ci, lwd = 3, col = "red")
abline(v = t.test(Presidents$ElectoralCollege)$conf.int, lwd = 3, col =
"green", lty = 2)
legend("topright", c("Original CI", "Boot CI"), lwd = 3, col =
c("green","red"), lty = c(2,1))
```
Our one sample t- test gave us a 95% confidence interval of the true
Electoral College mean between 66.7% and 75. We ran bootstraps to find the
true mean, with the scatterplot showing us in bright red the average sample
means and the histograms showing the spread of those means. Ultimately
though, our bootstrap confidence interval was only slightly more accurate.
However, from a practical standpoint, the fact that only 50% of the
Electoral college is needed to win the presidency, and the average win is
70%, shows that most presidential elections have not been as neck and neck
as in recent years.
```{r,echo = FALSE, results = 'hide',out.width = "54%"}
bootplot <- function(data, nrep, plotit = F){
N <- nrow(data)
slopes <- as.list(rep(NA, nrep))
for(i in 1:nrep){
s <- sample(1:N, N , replace = T)
sVals <- as.numeric(names(table(s)))
sCounts <-
as.vector(table(s))
fakeData <-
data[s, ]
cor1 <- cor(fakeData[, 1], fakeData[,2])
lm1 <- lm(fakeData[, 2] ~ fakeData[, 1])
slopes[[i]] <- lm1$coefficients
if (plotit == T) {
plot(data, type = "n")
points(data[sVals, 1], data[sVals, 2], cex = sqrt(sCounts), pch = 19,
col = "red")
abline(lm1$coef, col = "blue", lwd = 3)
title(main = paste("cor = ",round(cor1, 2),", slope = ",
round(lm1$coef[2], 2), "Rep = ", i))
Sys.sleep(.5)
}
}
return(slopes)
}
```
### *Bootstrap of Electoral College and Supreme Court Correlation*
```{r, echo = FALSE,out.width = "54%"}
cor1 <- cor(Presidents$SupremeCourt, Presidents$ElectoralCollege, use =
"complete.obs")
cor1
cor.test(Presidents$SupremeCourt, Presidents$ElectoralCollege, use =
"complete.obs")
lm1 <-
lm(Presidents$ElectoralCollege ~ Presidents$SupremeCourt)
summary(lm1)
names(lm1)
plot( Presidents$SupremeCourt, Presidents$ElectoralCollege,main = paste("r
= ", round(cor1,2) , ", slope = ", round(lm1$coef[2],2)),
pch = 19, col = "red")
abline(lm1$coef, col = "blue", lwd = 3)
Max <- cbind(Presidents$SupremeCourt, Presidents$ElectoralCollege)
N <- nrow(Max)
n_samp <- 10000
corResults <- rep(NA, n_samp)
bResults <- rep(NA, n_samp)
for(i in 1:n_samp){
s <- sample(1:N, N , replace = T)
sVals <- as.numeric(names(table(s)))
sCounts <fakeSC <fakeEC <-
as.vector(table(s))
rep(Presidents$SupremeCourt[sVals], sCounts)
rep(Presidents$ElectoralCollege[sVals], sCounts)
cor1 <- cor(fakeSC, fakeEC, use = "complete.obs")
lm1 <- lm(fakeEC ~ fakeSC)
corResults[i] <- cor1
bResults[i] <- lm1$coef[2]
}
ci_r <- quantile(corResults, c(.025, .975), na.rm = T)
ci_slope <- quantile(bResults, c(.025, .975), na.rm = T)
hist(corResults, col = "blue", main = "Bootstrapped Correlations", xlab =
"Sample Correlation", breaks = 50)
abline(v = ci_r, lwd = 3, col = "red")
abline(v = cor.test(Presidents$SupremeCourt, Presidents$ElectoralCollege)
$conf.int, lwd = 3, col = "green", lty = 2)
legend(-0.3, 450, c("Theoretical CI","Boot CI"), lwd = 3, col =
c("green","red"), lty = c(2,1))
qqPlot(corResults, main = "qqPlot of corResults")
#Bootstrap Line Graphs
TTest <- t.test(Presidents$ElectoralCollege, Presidents$SupremeCourt)
(TestCor <- cor(Presidents$ElectoralCollege, Presidents$SupremeCourt, use =
"complete.obs"))
qqPlot(Presidents$ElectoralCollege, col = 2, pch = 19)
cor1 <- cor(Presidents$SupremeCourt, Presidents$ElectoralCollege, use =
"complete.obs")
bootlines <- bootplot(Max, 100, plotit = F)
lm1 <-
lm(Presidents$ElectoralCollege ~ Presidents$SupremeCourt)
plot(Max, main = paste("Electoral College and Supreme Court Data, r = ",
round(cor1,2) , ", slope = ", round(lm1$coef[2],2)),
pch = 19, col = "red", xlab = " Number of Supreme Court
Confirmations", ylab = " Electoral College Percent")
abline(lm(Presidents$ElectoralCollege ~ Presidents$SupremeCourt), col =
"blue", lwd = 3)
for (i in 1:50){
abline(bootlines[[i]], col = "green", lty = 2)
}
```
We explored the relationship between Supreme Court confirmations and
Electoral College percentages, and we found that there is a statistically
significant positve relationship between the two. We determined from our
bootstrap that the true correlation is around 0.44. So an increase in
Supreme Court confirmations is correlated with an increase in election
performance. This makes sense, because judicial appoitnment is an important
aspect of a candidate for voters.
### *Permutation*
```{r, echo = FALSE, out.width = "54%"}
fakeEC <- sample(fakeEC)
plot(Presidents$SupremeCourt, fakeEC, pch=19, col="red", main="")
mtext("Fake Relationship between Supreme Court and Electoral College",
cex=1.2, line=1)
mtext(paste("Correlation =", round(cor(Presidents$SupremeCourt, fakeEC, use
= "complete.obs"),2)), line=0, cex=1)
n_samp <- 10000
corResults <- rep(NA, n_samp)
for(i in 1:n_samp){
corResults[i] <- cor(Presidents$SupremeCourt,
sample(Presidents$ElectoralCollege), use = "complete.obs")
}
(truecor <- mean(abs(corResults) >=
abs(cor(Presidents$SupremeCourt,Presidents$ElectoralCollege, use =
"complete.obs"))))
hist(corResults, col = "yellow", main = "", xlab = "Correlations", breaks =
50)
mtext("Permuted Sample Correlations", cex = 1.2, line = 1)
mtext(paste("Permuted P-value =",round(truecor,4)), cex = 1, line = 0)
abline(v = cor(Presidents$SupremeCourt,Presidents$ElectoralCollege, use =
"complete.obs"), col="blue", lwd=3)
text(cor(Presidents$SupremeCourt,Presidents$ElectoralCollege, use =
"complete.obs")-.02, 200 ,paste("Actual Correlation =",
round(cor(Presidents$SupremeCourt,Presidents$ElectoralCollege, use =
"complete.obs"),2)),srt = 90)
```
We completed a permutation test to see if the correlation we found in our
bootstrap was dependent on the order of variables. So we scrambled up
Supreme Court variables and ran the test again. We found that a correlation
of .44 was statsitcially significant and highly unlikely, which just
validifies our previous evaluation. There is a statistically significant
correlation between Supreme court confirmations and election results.
### *Multiple Regression*
We were still interested in exploring the Electoral College variable
further, so we attempted to create a prodicting model. We did not use
variables that had NA's, and we used the stepAIC function to get a
backwards stepwise regression. We also looked at some individual variables
to test their strength.
```{r, echo = FALSE,out.width = "54%", include = FALSE}
#names(Presidents)
#str(Presidents)
library(MASS)
full.model <- lm(ElectoralCollege ~ AgeatElection + Party +ElectionYear
+Inaug +ExecOrder + SupremeCourt + TotalJ +Vetoes +Overridden + Educ +
PreviousExp +Military + Pardon + Height +Weight + NetWorth +ReWon , data =
Presidents)
# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both",
trace = T)
print("Final Output of stepAIC backwards regressions")
#summary(step.model)
```
```{r, echo = FALSE,out.width = "54%", return = 'hide'}
myResPlots2(step.model)
(Mod1 <- lm(ElectoralCollege ~
#summary(Mod1)
(Mod1 <- lm(ElectoralCollege ~
#summary(Mod1)
SupremeCourt, data = Presidents))
ExecOrder, data = Presidents))
```
Here our various regression models were trying to predict Electoral College
success. By using the stepAIC backwards regression function, we were left
with a regression equation with about 4 less terms than we started with,
and a higher AIC. Our final model suggested that previous military service,
winning reelection, amount of vetoes, and independent status may have some
statistically significant effects on election winning. (However, we had a
r-squared of .87, which is not as effective as it could be. Almost all of
our variables were not statistically significant either because of the size
of the data set,the content of the variables, or the fact that presidential
electoral results aren't entirely predictable.) We would definately like to
explore some more complex models concerning this. As an alternative, we
also took a look at some one off variables to see which one variable could
best describe Electoral College Percent by itself, and the result was
Supreme Court confirmations (with an r-squared of .1965), followed by
Executive Orders(r-squared = .11).
Our final model shows residuals that are normally distributed and do not
show signs of heterskedactisity. Our data set seems to have a good
distribution, but its naturally small size (only 45 presidents) is probably
responsible for our low r-squared values.
## *Fun Facts*
```{r, echo = FALSE,out.width = "54%"}
hist(AgeatElection, col = "yellow", main = "Histogram of Presidential Ages
at Election")
abline(v = mean(AgeatElection), col = "red", lw = 3, las = 2)
```
## *Out of presidents running for reelection, 70% have won, while 30% have
lost.*
### Tallest Presidents: Abraham Lincoln, Lyndon B.Johnson, Donald Trump,
George Washington, and Thomas Jefferson
```{r, echo = FALSE}
abcd <- head(Presidents[order(-Presidents$Height),],8)
```
Shortest Presidents: James Madison, Benjamin Harrison, Martin Van Buren,
William McKinley, John Quincy Adams
```{r, echo = FALSE,out.width = "54%"}
abcde <- tail(Presidents[order(-Presidents$Height),],7)
boxplot(AgeatElection ~ Educ, col = "red", main = "Age at Election by
Education", xlab = "Type of Education", ylab = "Age", las = 2, cex.axis = .
6)
```
*There seems to be no large discrepencies in age by education type, though
this should be looked at closer!*
## Proportional Pie Chart of Education of Presidents
```{r, echo = FALSE,out.width = "54%"}
library(plotrix)
pie3D(table(Presidents$Educ), theta=pi/4, explode=0.1,
labels=names(table(Presidents$Educ)), labelcex = 1.04)
```
## Proportional Pie Chart of Previous Experience of Presidents
```{r, echo = FALSE,out.width = "54%"}
library(plotrix)
pie3D(table(Presidents$PreviousExp), theta=pi/4, explode=0.1,
labels=names(table(Presidents$PreviousExp)), labelcex = 1.04)
```
## Proportional Pie Chart of Party of Presidents
```{r, echo = FALSE,out.width = "54%"}
library(plotrix)
pie3D(table(Presidents$Party), theta=pi/4, explode=0.1,
labels=names(table(Presidents$Party)), labelcex = 1.04)
```
# *Conclusion*
This has just been a very brief look into some of the possibilities opened
up by the Presidents data set. To summarize, I compiled various data points
from all American presidents by web scraping from several sources to create
a new data frame. We looked at the correlations and distributions of the
variables in our data frame and choice several relationships to examine
closer. We looked at the relationship between Electoral College Percentage
and Supreme Court Confirmations and found there to be a statistically
significant positive relationship between them. Then we tried various
regression models to try to predict Election success. By using the stepAIC
backwards regression function, we were left with a regression equation that
suggested that previous military service, winning reelection, amount of
vetoes, and independent status may have some affects of election winning.
We would definately like to explore some more complex models concerning
this. We also took a look at some one off variables to see which one
variable could best describe Electoral College Percent, and the result was
Supreme Court confirmations, followed by Executive Orders. Finally, we
looked at some fun facts of the president's ages, heights, and experiences.
*There is much more to discover and learn. This dataset will be continued
to be updated and grow*
![](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/
The_Seal_of_the_President_of_the_United_States.jpg/220pxThe_Seal_of_the_President_of_the_United_States.jpg)
```
Download