Jak Marshall The Presidential Lexicon – Linguistic Analysis of State of the Union Addresses 1790-2008 Introduction The goal of this analysis is to study various aspects of speeches given by US Presidents. In particular we look at State of the Union Addresses here. We will look at summary statistics such as the number of sentences in the speeches and the average words per sentence and how these attributes have varied among the presidents and over time. We will also compare the distribution of individual words between presidents. Methods From a raw text file containing all of the State of the Union Addresses I obtained various summary statistics. For each speech I counted both the number of individual sentences made and the average number of words per sentence. For each speech I also counted the frequency of stems. A stem is a common root word to various words in the speech. For example, for the words ‘runs’ and ‘running’ the stem ‘run’ is the appropriate stem. By counting stems we avoid classing very similar words as being entirely different words altogether, which is both inefficient and unnecessary. I wrote some computer code in the programming language ‘R’, which first separated the text file into ‘blocks’ of text where each ‘block’ contained only one speech made by one president at one time. This was possible because each speech was separated by three asterisks ‘***’ in the raw text file so the computer was able to recognize when one speech ended and another began. To get the number of sentences in each speech, or each ‘block’ the code then separated portions of the text by periods or ellipsis using a similar process and then the code simply counted the amount of these smaller blocks of text to give the amount of sentences in each speech. To get the number of words in a sentence, the code took each of the sentences in the speech (these were broken down earlier from the raw text to count the amount of sentences) and separated them individually by spaces, commas and various other punctuation marks to return a collection of words. For each sentence, the amount of such words were counted and the mean number of words was taken by simply adding all of these values together and dividing by the number of sentences in the speech which was calculated earlier. By attributing each speech with the president that made it and the year in history it was made, I was able to generate a colour coded plot which displays the number of sentences in the State of the Union addresses against time (colours used to identify the presidents) and a similar plot which does the same for mean number of words per sentence. I did not Jak Marshall produce the code to generate the plot itself. I only supplemented the data for the plot to use. For the stem counting I first converted the all of the words in the speeches into lower case using a command in the R language so that identical words were not classed differently because of capitalization. I then used a purpose-built piece of code called wordStem (not written by myself) to convert the entirety of each individual speech into stem form. This essentially returned each speech where every word was replaced by its stem. With some further code I created a list of all the unique stems that appear across all of the speeches. So now I essentially have for each speech a list of stems, including repeats. I also have a dictionary of stems that have appeared at least once in at least once speech. I was then able to use R to look at each individual speech and obtain the frequency of the all the stems in my stem dictionary that appear in that speech. A lot of the time frequency for a given stem will be zero in most speeches as it may only appear once in one speech only. So now for each speech I now had a distribution of the stems in that speech. Having obtained a list of stem frequencies for each speech (or the stem distribution for each speech). I was able to feed this data into a piece of code (again, not written by myself) to take all of these distributions and perform a multi-dimensional scaling operation on them to compute the distance between each of the distributions from each other. This multi-dimensional approach is required because each of the distributions is essentially a vector of the counts for each stem that appeared at least once in one of the speeches. These vectors will have been of length 13888 so we would need to perform an operation on 218 such vectors (there were 218 speeches) in R13888. Once these distances had been computed the same piece of code then produced a plot to represent the distances in stem distributions between the presidents by grouping the speeches by orator. I also wrote some code to count the overall frequency of the stems across all of the speeches (rather than within each individual speech) and ordered the list in decreasing order to see the top 25 words used by US Presidents in the State of the Union Addresses. Results The colour coded plots for the number of sentences and mean number of words per sentence for all of the speeches are plotted against time below. Jak Marshall The plot shows us that the number of sentences in State of the Union Addresses tended to increase over time from George Washington’s first address and then dropped sharply with the arrival of Woodrow Wilson and then generally the number of sentences either increased slightly above or remained steady at this level until the present day. This is with the exception of the unusually long speeches given by Harry S. Truman and Jimmy Carter. Another noticeable trend is the decrease in mean sentence length over time from George Washington to George Bush. The patterns shows us that as we approach the modern presidents, the length of sentences become smaller and smaller. Although the mean sentence length can be quite varied and erratic to look at, the plot clearly shows a decline in mean sentence length over time. A ‘best fit line’ mode of thinking will show this. Now I turn to look at the results of the multidimensional scaling operation, which compared the stem distributions of the various presidents’ speeches. A colour code has also been used to identify the time period of the presidents displayed here. Jak Marshall Looking at the plot it would seem that presidents who are closer together in history also seem to be closer together in stem distributions. George Washington seems to have the most similar stem distribution to any other president from his era. This is generally true for every president. However, the older presidents seems to have more conformity with the stem distribution among older presidents being more clustered that a collection of modern presidents, whose stem distributions are not so clustered together. There is almost a parabolic shape to this plot and there seems to be a very clear divide between post 1950s presidents and pre 1950s presidents with each group occupying one side of the plot a considerable distance away from the other group. The top 25 words used across all the speeches are as follows: 1 – the, 2 – of, 3 – to, 4 – and, 5 – in, 6 – a, 7 – that, 8 – it, 9 – be, 10 – for, 11- our, 12 – is, 13 – by, 14 – have, 15 – which, 16 – as, 17 – with, 18 – this, 19 – we, 20 – will, 21 – I, 22 – on, 23 – has, 24 – state, 25 – not. Conclusions From my analysis of the State of the Union Addresses I have learned how various aspects of the presidential speeches have changed. Sentences have gotten shorter and whilst the length of the addresses seemed to get longer and longer at one stage, it seemed that the presidents post Wilson decided to stick to a certain amount of sentences in their speeches which is not as low as George Washington’s but not as high as Truman’s post war speech. I have also found that the most commonly used words in these speeches are simple syntactical words such as ‘the’ and ‘is’ that are among the most ubiquitous words in most written documents. The only notable non-syntactical word is ‘state’ which is unsurprising Jak Marshall given the nature of the documents being studied here. Specific words such as ‘Afghanistan’ and ‘terrorist’ don’t get used by all of the presidents and the presidents that do use these words don’t tend to repeat them to the extent of everyday words like ‘and’ and ‘as’ would be. In further research I’d like to see which stems are most popular other than these very common syntactical words to see which ‘meaningful’ words are most commonly used. The criterion for which words should be omitted is also a problem to be considered. Harry S. Truman’s speech came shortly after the end of the Second World War whilst Jimmy Carter presided over a severe oil crisis. This may be why these addresses are so long but this is just a theory of mine. Sentences have definitely gotten a lot shorter in modern speeches from the first plot and the stem distributions of the modern speeches are very different to the older ones. The stem distribution difference may be a general trend seen in all documents from these times, as the change in language and choice of words and styles over time is well-known phenomenon. However, the shortening of sentences may have various explanations. Perhaps the introduction of mass media has persuaded more presidents to adopt more succinct, sharp and perhaps even more theatrical ways of delivering their addresses or this could be just another phenomenon caused by the passage of time. This would be something worth considering in future studies. Appendix – R code #State of the Union Analysis #Graded Assignment 4 #Jak Marshall # - Here I read in the text file from bspace which first contains all of the State of the Union Addresses after a list of all the presidents and dates of the speeches. options(max.print=10000000) alltext<readLines(con="/class/u/s133/s133ap/Desktop/stateoftheunion17902008.txt") # - Using indexing I strip out the lines which list the presidents and the dates of the speeches prestext<-alltext[11:228] # - I use the same regular expression with two different back references to create first a vecotr Pres<-gsub(pattern="^ ([^,]+), [^,]+, ([^,]+, [[:digit:]]+)", "\\1", prestext) RawDates<-gsub(pattern="^ ([^,]+), [^,]+, ([^,]+, [[:digit:]]+)", "\\2", prestext) # - Using strptime and another regular expression I can convert dates from the given format to the POSIXt format. Dates<-strptime(RawDates, format= "%B %d, %Y") #---------------------------- Jak Marshall #The following code creates the vectors meanwords and nsentences required to produce the first plot speech.sentences<-speech.words<- vector("list", length(Pres)) meanwords<- rep(NA, times= length(Pres)) nsentences<- rep(NA, times= length(Pres)) speechtext<-alltext[233:168851] a<- c(grep("\\*\\*\\*", speechtext),length(speechtext))-2 b<- c(5,grep("\\*\\*\\*", speechtext)+6) for(i in 1:218){ x<-speechtext[b[i]:a[i]] x<-paste(x, sep="", collapse=" ") speech.sentences[i]<-strsplit(x, split="\\.+") speech.words[i]<-strsplit(x, split="[[:punct:] +]+") nsentences[i]<-length(speech.sentences[[i]]) meanwords[i]<length(speech.words[[i]])/length(speech.sentences[[i]]) } #----------------------#This code produces the first plot par(mfrow = c(2, 1), mar = c(0, 4, 1, 1), oma = c(12, 0, 2, 0), xpd = NA) plot(Dates, nsentences, type = "l", col = "grey", xaxt = "n", ylab = "Number of Sentences") title(main = "Summary Statistics for State of the Union Addresses", line = 1) points(Dates, nsentences, pch = 16, col = match(Pres, unique(Pres))) plot(Dates, meanwords, type = "l", col = "grey", ylab = "Avg Words per Sentence") points(Dates, meanwords, pch = 16, col = match(Pres, unique(Pres))) text(Dates[!duplicated(Pres)], 5, unique(Pres), srt = -90, col = 1:length(Pres), adj = 0, srt = -90)> dev.print(pdf, file = "SoUplot.pdf", height = 8, width = 10) # - I install the library I need and create the vector of stems in speeches speech.stems and the stem dictionary all.stems library(Rstem,lib.loc="~/Rlibs") speech.stems<-lapply(lapply(speech.words,tolower), wordStem) all.stems<-unique(unlist(speech.stems)) # - I create the matrix of stem frequencies for all the speeches stem.count<-matrix(nrow=length(all.stems), ncol=218) for(i in 1:218){ stem.count[,i]<-tabulate(factor(speech.stems[[i]],levels=all.stems) ,length(all.stems)) } # - this code used stem.count to give the second plot. source("/class/u/s133/s133ap/Rlibs/SJD.R") Jak Marshall presfreq <- function(president, stem.count, Pres){ submat <- stem.count[,Pres == president, drop = FALSE] wordcounts <- apply(submat, 1, sum) return(wordcounts/sum(wordcounts)) } word.freq <- sapply(unique(Pres), presfreq, stem.count, Pres) SJ.mat <- computeSJDistance(terms = all.stems, df = apply(word.freq > 0, 1, sum), tf = word.freq) mds <- cmdscale(SJ.mat) require(fields) lim <- range(mds[,1] + c(-0.15, 0.15) * diff(range(mds[,1]))) # Make room for names par(oma = c(3, 0, 0, 0)) plot(mds, type = "n", xlim = lim, bty = "l", xlab = "", ylab = "") text(mds, label = unique(Pres), col = tim.colors(length(unique(Pres)))) title(main = "Distances Between Presidentional Word Frequency Distributions") image.plot(zlim = c(1790, 2008), col = tim.colors(40), legend.only = TRUE, horizontal = TRUE, legend.mar = 0, legend.shrink = 0.7) dev.print(pdf, file = "mds.pdf", height = 7, width = 10) # - This code gives the total amount of uses of each stem stem.sums<-apply(stem.count, MARGIN=1, sum) # - This code gives each element its associated word as the names of the vector names(stem.sums)<-all.stems # - This puts them in order. sort(stem.sums, decreasing=TRUE) #----------------