ch1-links - York University

advertisement
Course outline
histogram.txt / barplot.txt / piechart.txt / stem-and-leaf.txt / scatter-and-bar.txt / ch1-image
Kerns chapter on types of data and basic R commands
Ch1 exercises: 1.23, 1.29, 1.32, 1.36, 1.38
1.23 Medical students. Students who have finished medical school are assigned to residencies in
hospitals to receive further training in a medical specialty. Here is part of a hypothetical data
base of students seeking residency positions. USMLE is the student’s score on Step 1 of the
national medical licensing examination.
(a) What individuals does this data set describe?
(b) In addition to the student’s name, how many variables does the data set contain? Which of
these variables are categorical and which are quantitative?
answer
(a) The individuals are Laurie Abrams, Gordon Brown, Maria Cabrera, and Miranda Ismael.
(b) There are five variables (excluding Name): Medical school, Sex, and Specialty sought are
categorical;
Age and USMLE are quantitative.
1.29 Canadian students rate their universities. The National Survey of Student Engagement
asked students at many universities, “How would you evaluate your entire educational
experience at this university?” Here are the percents of senior-year students at Canada’s 10
largest primarily English-speaking universities who responded “Excellent”:
(a) The list is arranged in order of undergraduate enrollment. Make a bar graph with the bars in
order of student rating.
(b) Explain carefully why it is not correct to make a pie chart of these data.
(a) Here is the barplot in descending order of students rating.
(b) The percentage shown is for each university; the percentages do not add up to 100%.
1.32 Returns on common stocks. The return on a stock is the change in its market price plus
any dividend payments made. Total return is usually expressed as a percent of the beginning
price. Figure 1.16 is a histogram of the distribution of the monthly returns for all stocks listed on
U.S. markets from January 1985 to November 2010 (311 months). The extreme low outlier is the
market crash of October 1987, when stocks lost 23% of their value in one month. The other two
low outliers are 16% during August 1998, a month when the Dow Jones Industrial Average
experienced its second largest drop in history to that time, and the financial crisis in October
2008 when stocks lost 17% of their value.
FIGURE 1.16
The distribution of monthly percent returns on U.S. common stocks from January 1985 to
November 2010, for Exercise 1.32.
(a) Ignoring the outliers, describe the overall shape of the distribution of monthly returns.
(b) What is the approximate center of this distribution? (For now, take the center to be the value
with roughly half the months having lower returns and half having higher returns.)
(c) Approximately what were the smallest and largest monthly returns, leaving out the outliers?
(This is one way to describe the spread of the distribution.)
(d) A return less than zero means that stocks lost value in that month. About what percent of all
months had returns less than zero?
Answer
(a) The distribution is slightly skewed to the left.
(b) The centre of the distribution is 0 to 2.5 percent.
(c) The smallest value (ignoring the low outliers) is between -12.5 and -10 percent; the largest
value is
between 12.5 and 15 percent.
(d) About 130 of the 311 months (although you’re your estimate could differ), or about 42
percent of the months, had negative
returns.
1.36 Carbon dioxide emissions. Burning fuels in power plants and motor vehicles emits carbon
dioxide (CO2), which contributes to global warming. Table 1.6 (on the following page) displays
the 2007 CO2 emissions per person from countries with populations of at least 30 million in that
year.25
(a) Why do you think we choose to measure emissions per person rather than total CO emissions
for each country?
(b) Make a stemplot to display the data of Table 1.6. The data will first need to be rounded.
What units are you going to use for the stems? The leaves? You should round the data to the
units you are planning to use for the leaves before drawing the stemplot. Describe the shape,
center, and spread of the distribution. Which countries are outliers?
Answer:
(a) Emissions are measured per person because, other things equal, more populous countries
would naturally
be expected to have higher levels of emissions.
(b)
Read the data by this R command:
data<-read.csv("http://www.yorku.ca/nuri/econ2500/moorebps6e/data/tbl-1.6-co2emissions.csv",header=T)
attach(data)
names(data)
head(data)
Country
CO2
1 Afghanistan 0.0272
2
Algeria 4.1384
3
Argentina 4.6525
4 Bangladesh 0.2773
5
Brazil 1.9373
6
Canada 16.9171
Co2<-data[,2]
Co2<-round(Co2,1)
stem(Co2)
The decimal point is at the |
0 | 001113333689344589
2|3
4 | 011479
6 | 0897
8 | 238968
10 | 58
12 |
14 |
16 | 9
18 | 9
The distribution is positively skewed, with center around 4 tons per person, and spread (excluding the
two high outliers) from 0 tons per person
to 10 tons per person. Canada and the US are outliers. (Your stem plot could differ.)
1.38 Do women study more than men? We asked the students in a large first-year college class
how many minutes they studied on a typical weeknight. Here are the responses of random
samples of 30 women and 30 men from the class:
(a) Examine the data. Why are you not surprised that most responses are multiples of 10
minutes? What is the other common multiple found in the data? We eliminated one student
who claimed to study 10,000 minutes per night. Are there any other responses you consider
suspicious?
(b) Make a back-to-back stemplot to compare the two samples. That is, use one set of stems
with two sets of leaves, one to the right and one to the left of the stems. (Draw a line on either
side of the stems to separate stems and leaves.) Order both sets of leaves from smallest at the
stem to largest away from the stem. Report the approximate midpoints of both groups. Does
it appear that women study more than men (or at least claim that they do)?
answer
(a) Students will tend to round off their estimates; there’s further evidence of rounding to half
hours and
hours. The 0 seems suspicious (or perhaps not!).
Read the data in R:
data<-read.table("http://www.yorku.ca/nuri/econ2500/moorebps6e/data/xrs-1.38-data.txt",header=T)
attach(data)
names(data)
study<-data[,2]
women<-study[1:30]
men<-study[31:60]
library(aplpack)
stem.leaf.backback(women,men,unit=10,m=2)
(b) The back-to-back stemplot appears below, with women on the left and men on the right;
I used 10s for
the leaf digits and split each stem into two parts. Both distributions are slightly skewed to the
right.
The centre of the distribution for men, about 120 minutes, is less than the centre for women,
about
180 minutes, so women do claim to study more than men (Your stem plot could differ.)
women: 60 90 115 120 120 120 120 120 120 120 150 150 150 150 170 180 180 180 180 180
180 180 180 180 200 240 240 240 270 360
men:
0 30 30 30 30 45 60 60 60 75 90 90 90 95 120 120 120 120 120 120 150
150 150 180 200 200 230 240 240 300
stem(women)
The decimal point is 2 digit(s) to the right of the |
0 | 69
1 | 22222222
1 | 55557888888888
2 | 0444
2|7
3|
3|6
The decimal point is 2 digit(s) to the right of the |
0 | 03333
0 | 56668999
1 | 0222222
1 | 5558
2 | 00344
2|
3|0
stem.leaf.backback(women,men,unit=10,m=2)
_________________________________________________
1 | 2: represents 120, leaf unit: 10
women
men
_________________________________________________
| 0* |033334
6
2
96| 0. |66679999
14
10
22222221| 1* |222222
(6)
(14) 88888888875555| 1. |5558
10
6
4440| 2* |00344
6
2
7| 2. |
| 3* |0
1
_________________________________________________
HI: 360
n:
30
30
women
60 90 115 120 120 120 120 120 120 120 150 150 150 150 170 180 180 180 180
180 180 180 180 180 200 240 240 240 270 360
>
men
0 30 30 30 30 45 60 60 60 75 90 90
120 150 150 150 180 200 200 230 240 240 300
90
95 120 120 120 120 120
Download