Course outline histogram.txt / barplot.txt / piechart.txt / stem-and-leaf.txt / scatter-and-bar.txt / ch1-image Kerns chapter on types of data and basic R commands Ch1 exercises: 1.23, 1.29, 1.32, 1.36, 1.38 1.23 Medical students. Students who have finished medical school are assigned to residencies in hospitals to receive further training in a medical specialty. Here is part of a hypothetical data base of students seeking residency positions. USMLE is the student’s score on Step 1 of the national medical licensing examination. (a) What individuals does this data set describe? (b) In addition to the student’s name, how many variables does the data set contain? Which of these variables are categorical and which are quantitative? answer (a) The individuals are Laurie Abrams, Gordon Brown, Maria Cabrera, and Miranda Ismael. (b) There are five variables (excluding Name): Medical school, Sex, and Specialty sought are categorical; Age and USMLE are quantitative. 1.29 Canadian students rate their universities. The National Survey of Student Engagement asked students at many universities, “How would you evaluate your entire educational experience at this university?” Here are the percents of senior-year students at Canada’s 10 largest primarily English-speaking universities who responded “Excellent”: (a) The list is arranged in order of undergraduate enrollment. Make a bar graph with the bars in order of student rating. (b) Explain carefully why it is not correct to make a pie chart of these data. (a) Here is the barplot in descending order of students rating. (b) The percentage shown is for each university; the percentages do not add up to 100%. 1.32 Returns on common stocks. The return on a stock is the change in its market price plus any dividend payments made. Total return is usually expressed as a percent of the beginning price. Figure 1.16 is a histogram of the distribution of the monthly returns for all stocks listed on U.S. markets from January 1985 to November 2010 (311 months). The extreme low outlier is the market crash of October 1987, when stocks lost 23% of their value in one month. The other two low outliers are 16% during August 1998, a month when the Dow Jones Industrial Average experienced its second largest drop in history to that time, and the financial crisis in October 2008 when stocks lost 17% of their value. FIGURE 1.16 The distribution of monthly percent returns on U.S. common stocks from January 1985 to November 2010, for Exercise 1.32. (a) Ignoring the outliers, describe the overall shape of the distribution of monthly returns. (b) What is the approximate center of this distribution? (For now, take the center to be the value with roughly half the months having lower returns and half having higher returns.) (c) Approximately what were the smallest and largest monthly returns, leaving out the outliers? (This is one way to describe the spread of the distribution.) (d) A return less than zero means that stocks lost value in that month. About what percent of all months had returns less than zero? Answer (a) The distribution is slightly skewed to the left. (b) The centre of the distribution is 0 to 2.5 percent. (c) The smallest value (ignoring the low outliers) is between -12.5 and -10 percent; the largest value is between 12.5 and 15 percent. (d) About 130 of the 311 months (although you’re your estimate could differ), or about 42 percent of the months, had negative returns. 1.36 Carbon dioxide emissions. Burning fuels in power plants and motor vehicles emits carbon dioxide (CO2), which contributes to global warming. Table 1.6 (on the following page) displays the 2007 CO2 emissions per person from countries with populations of at least 30 million in that year.25 (a) Why do you think we choose to measure emissions per person rather than total CO emissions for each country? (b) Make a stemplot to display the data of Table 1.6. The data will first need to be rounded. What units are you going to use for the stems? The leaves? You should round the data to the units you are planning to use for the leaves before drawing the stemplot. Describe the shape, center, and spread of the distribution. Which countries are outliers? Answer: (a) Emissions are measured per person because, other things equal, more populous countries would naturally be expected to have higher levels of emissions. (b) Read the data by this R command: data<-read.csv("http://www.yorku.ca/nuri/econ2500/moorebps6e/data/tbl-1.6-co2emissions.csv",header=T) attach(data) names(data) head(data) Country CO2 1 Afghanistan 0.0272 2 Algeria 4.1384 3 Argentina 4.6525 4 Bangladesh 0.2773 5 Brazil 1.9373 6 Canada 16.9171 Co2<-data[,2] Co2<-round(Co2,1) stem(Co2) The decimal point is at the | 0 | 001113333689344589 2|3 4 | 011479 6 | 0897 8 | 238968 10 | 58 12 | 14 | 16 | 9 18 | 9 The distribution is positively skewed, with center around 4 tons per person, and spread (excluding the two high outliers) from 0 tons per person to 10 tons per person. Canada and the US are outliers. (Your stem plot could differ.) 1.38 Do women study more than men? We asked the students in a large first-year college class how many minutes they studied on a typical weeknight. Here are the responses of random samples of 30 women and 30 men from the class: (a) Examine the data. Why are you not surprised that most responses are multiples of 10 minutes? What is the other common multiple found in the data? We eliminated one student who claimed to study 10,000 minutes per night. Are there any other responses you consider suspicious? (b) Make a back-to-back stemplot to compare the two samples. That is, use one set of stems with two sets of leaves, one to the right and one to the left of the stems. (Draw a line on either side of the stems to separate stems and leaves.) Order both sets of leaves from smallest at the stem to largest away from the stem. Report the approximate midpoints of both groups. Does it appear that women study more than men (or at least claim that they do)? answer (a) Students will tend to round off their estimates; there’s further evidence of rounding to half hours and hours. The 0 seems suspicious (or perhaps not!). Read the data in R: data<-read.table("http://www.yorku.ca/nuri/econ2500/moorebps6e/data/xrs-1.38-data.txt",header=T) attach(data) names(data) study<-data[,2] women<-study[1:30] men<-study[31:60] library(aplpack) stem.leaf.backback(women,men,unit=10,m=2) (b) The back-to-back stemplot appears below, with women on the left and men on the right; I used 10s for the leaf digits and split each stem into two parts. Both distributions are slightly skewed to the right. The centre of the distribution for men, about 120 minutes, is less than the centre for women, about 180 minutes, so women do claim to study more than men (Your stem plot could differ.) women: 60 90 115 120 120 120 120 120 120 120 150 150 150 150 170 180 180 180 180 180 180 180 180 180 200 240 240 240 270 360 men: 0 30 30 30 30 45 60 60 60 75 90 90 90 95 120 120 120 120 120 120 150 150 150 180 200 200 230 240 240 300 stem(women) The decimal point is 2 digit(s) to the right of the | 0 | 69 1 | 22222222 1 | 55557888888888 2 | 0444 2|7 3| 3|6 The decimal point is 2 digit(s) to the right of the | 0 | 03333 0 | 56668999 1 | 0222222 1 | 5558 2 | 00344 2| 3|0 stem.leaf.backback(women,men,unit=10,m=2) _________________________________________________ 1 | 2: represents 120, leaf unit: 10 women men _________________________________________________ | 0* |033334 6 2 96| 0. |66679999 14 10 22222221| 1* |222222 (6) (14) 88888888875555| 1. |5558 10 6 4440| 2* |00344 6 2 7| 2. | | 3* |0 1 _________________________________________________ HI: 360 n: 30 30 women 60 90 115 120 120 120 120 120 120 120 150 150 150 150 170 180 180 180 180 180 180 180 180 180 200 240 240 240 270 360 > men 0 30 30 30 30 45 60 60 60 75 90 90 120 150 150 150 180 200 200 230 240 240 300 90 95 120 120 120 120 120