Homework 7 – Due November 7, 12 am

advertisement
Homework 7 – Due November 7, 12 am
The total points on this homework is 100. Out of these 20 points are reserved for clarity of presentation,
punctuation and commenting with respect to the code.
1. The Florida 2000 Elections. The 2000 US Presidential election turned on Florida which had a number
of balloting issues. Among them were the fact that the ballot in Palm Beach County had the potential
to lead voters favoring Democratic Party candidate Al Gore to mistakenly cast them in favor of Reform
Party candidate Pat Buchanan. (The other candidates in the Presidential election were Republican
George W. Bush and Green Party candidate Ralph Nader.) A Senate race was also decided in Florida
at this time: the candidates were the Democratic Party’s Nelson, Republican McCollum, and a third
party candidate Logan.)
In this problem, we will analyze and compare the county vote totals for all candidates in the two elections . The dataset at http//:maitra.public.iastate.edu/stat579/datasets/FL2000vote.dat
provides data on the returns from each of 66 counties in Florida for the major candidates for the two
statewide races in 2000 (elections for the US Presidency and Senate). County-wise vote totals for four
Presidential candidates (Buchanan-Reform, Bush-Republican, Gore-Democratic, Nader-Green) and
three Senate candidates (Logan-Other, McCollum-Republican and Nelson-Democratic) are reported.
(a) For each race, calculate the countywide totals of votes cast, but don’t report the numbers. Instead,
plot the two sets of totals on the same plot, but with different characters and against the 66
counties (in alphabetical order). Are there any counties for which there were more votes cast for
the three candidates in the Senatorial contest than for the four major Presidential candidates?
[4+4+2 points]
(b) Draw piecharts for the proportion of votes secured by the four Presidential candidates over the
entire state and in Palm Beach County. [10 points]
(c) Report the proportion of votes in favor of Buchanan (along with the standard error) in Palm
Beach County, home of the infamous butterfly ballot, and over all the other 65 counties. Are
there any summary conclusions that can be drawn? State also your assumptions. [(2+2)× 2
+1+1 points]
(d) Display the vote totals of all four candidates on the same plot. In order to do this, plot the countywide vote totals for Bush against those for Gore, but use plotting characters with character size
(cex) proportional to county-wide vote totals for Buchanan and color to signify the percentage
of votes for Nader. Comment and use text with the locator() function to identify observations
(counties) that stand out. [10 points]
(e) Do a similar exercise as above but instead, represent the vote proportions of the two major party
Presidential (Bush and Gore) and the senatorial (McCollum and Nelson) candidates. 1 Comment.
[10 points]
(f) Now, we will consider the proportion of votes polled by the two Democratic candidates and
how they relate to each other when controlled for the proportion of votes polled by Buchanan.
Provide a conditioning plot of the vote proportion of Gore against Nelson conditioned by the vote
proportion of Buchanan. [10 points]
2. In this problem we graphically demonstrate the cental limit theorem. We generate 100 random samples
of size n = 5 from a gamma distribution with scale parameter equal to 1 and shape parameter equal
to α = 1, and compute the sample means. Plot the box plot of these simulated sample means. Repeat
this for sample sizes of n=15 and 30, obtaining three box plots in all. Now repeat this procedure for
1 This represents a way to represent four-dimensional information on a two-dimensional plot without any additional reduction.
A fifth dimension that may be used is plotting character.
Stat 579, Fall 2011 – Maitra
2
samples generated from gamma distributions with scale parameter equal to 1 and shape parameter
equal to α = 2 and 5, respectively. Plot the 9 plots on one page with those for the three sample sizes
appearing on the same row, those for different α values on separate rows, labelling each graph with
appropriate titles such as “shape=2, n=5” , etc. [10 points]
3. The R function hist() returns an object that is a list with components. One component is the vector
density containing estimated density values. R’s default for density with equi-spaced breaks (i.e.,
equi-spaced bin intervals) are the relative frequencies. If plot=T which is the default, it also produces
a histogram in an RGD that plots the counts component, a vector containing the frequency counts,
on the y-axis. If freq=F is specified, the values of the vector density (the component that contains
relative frequencies) are plotted instead on the y-axis.
In this problem you will plot a histogram of a random sample of size 100 generated from an exponential
distribution with mean 5, and overlay a line plot of the density curve of the same distribution on the
histogram. Use the hist() and the curve() functions to accomplish this. You must avoid any parts
of the plot getting chopped off due to one of the axes being of insufficient length. Hint: You may
consider making your graph big enough to hold both plots. Scale the y-axis in hist() using the ylim=
parameter. Obtain a value for this by first calculating the maximum heights of both the bars of the
histogram and density curve. [10 points]
Download