P a g e |1 CHAPTER 1 Descriptive Statistics 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Introduction Basic concepts Sampling schemes Graphical representation of data Numerical description of data Computers and statistics Chapter summary Computer examples Projects for Chapter 1 Statistical software R is used for this book. All outputs and codes given are in R. R is a free statistical software, and it can be downloaded from the website: http://www.r-project.org P a g e |2 Exercises 1.2 1.2.1 The suggested solutions: For qualitative data we can have color, sex, race, Zip code and so on. For quantitative data we can have age, temperature, time, height, weight and so on. For cross section data we can have school funding for each department in 2000. For time series data we can have the crude oil price from 1995 to 2008. 1.2.3 The suggested questions can be: 1. What types of data the amounts are? 2. Do these Federal Agency receive the same amount of funding? If not, why? 3. Which Federal Agency should receive more funding? Why? The suggested inferences we can make are: 1. These Federal Agency get different amount of money. 2. There are big differences between funding the Agencies receive. Exercises 1.3 1.3.1 Simple Random Sample: Say we have a population of 1,000 students, and we want a sample of 100 students. Using software or a random table, we randomly select 100 out of the 1,000 students. We want the selection probability for all the students to be equal. That is no student is more likely to be selected than any other student. Systematic Sample: Again, we have a population of 1,000 students, and we want a sample of 100 students. We need the sampling interval k = N/n = 10. Now, we need a random starting point between 1 and k. Let say, we randomly select 4. This gives us the sample: 4, 14, 24, ..., 74, 84, 94. This sample of numbers will correspond to ordered list of students. Stratified Sample: Suppose we decide to sample 100 college students from the population of 1000 ( that is 10% of the population). We know these 1000 students come from three different major, Math, Computer Science and Social Science. We have Math 200, CS 400 and SS 400 students. Then we choose 10% of each of them Math 20, CS 40 and SS 40 by using simple random sample within each major. Cluster Sample: Presume we have a population of 1,000 students clustered into 10 departments. For our sample of students, we will randomly select a subset from the 10 departments. Let say we P a g e |3 randomly select 3 out 10 departments. Now, all the students on those 3 department become the sample from the population of students. Exercises 1.4 1.4.1 (a) Bar graph Bar graph for the percent of road mileage 35.00% 30.00% 25.00% C2 20.00% 15.00% 10.00% 5.00% 0.00% Poor Mediocre Fair C1 Good Very good Pie chart of the percent of road mileage C ategory Poor Very good Good Fair Mediocre (b) Pie chart 1.4.3 (a) Bar graph Bar graph 40.00% C2 30.00% 20.00% 10.00% 0.00% Coal Natural Gas Nyclear Electric Power C1 Petrolium Renewable Energy P a g e |4 (b) Pareto graph 100 0.8 80 0.6 60 0.4 40 0.2 20 0.0 C1 Pe Percentage Percent Cum % m li u tro ra tu Na 0.40 40.0 40.0 as lG al Co ec El ar 0.23 23.0 63.0 e cl Ny 0.22 22.0 85.0 ic tr w Po er R ew en le ab 0.08 8.0 93.0 Percent Percentage Pareto graph 1.0 0 gy er En 0.07 7.0 100.0 Pie chart of species species Category Coal Natural Gas Ny clear Electric Power Petrolium Renewable Energy (c) Pie chart 1.4.5 bar graph 6 5 Count 4 3 2 1 0 A B C C1 (a) Bar graph D F Pie chart species Category A B C D F (b) Pie chart P a g e |5 1.4.7 (a) Pie chart Pie chart species Category Mining Construction Manufacturing Transportation Wholesale Retail Finance Services (b) Bar graph Bar graph 8000 7000 6000 C2 5000 4000 3000 2000 1000 0 g in in M n g n io in tio ct ur rta ru ct st fa po n u s Co an an M Tr W le sa le ho Re il ta e nc na Fi Se ic e rv C1 1.4.9 Bar graph Bar graph 80 70 60 C2 50 40 30 20 10 0 1900 1960 1980 C1 1990 2000 s P a g e |6 1.4.11 (a) Bar graph Bar graph 300 250 C2 200 150 100 50 0 Accidents Chronic C Cancer Diabetes C1 Heart Kidney Pneumonia Strok e Suicide (b) Pareto graph Pareto graph 700 100 600 80 60 400 300 40 200 20 100 0 C1 Percentage Percent Cum % a He rt C 268.0 38.7 38.7 r ce an 199.4 28.8 67.6 ke ro St ia on C 58.5 8.5 76.0 s es nt et de ab ci Di Ac m eu Pn 42.3 35.1 6.1 5.1 82.1 87.2 34.5 5.0 92.2 23.9 3.5 95.6 r he Ot 30.2 4.4 100.0 1.4.13 Histogram 9 8 7 Frequency 6 5 4 3 2 1 0 60 70 80 C1 90 0 Percent Percentage 500 P a g e |7 1.4.15 ( a ) Stem and leaf Stem-and-leaf of SAT Mathematics scores N = 20 Leaf Unit = 10 1 4 7 3 4 99 8 5 00011 10 5 22 10 5 4455 6 5 6667 2 5 9 1 6 0 Histogram 5 Frequency 4 3 2 1 0 480 500 520 (b) Histogram 540 C1 560 580 Pie Chart Category 470-490 490-510 510-530 530-550 550-570 570-590 590-610 (c) Pie chart 1.4.17 Pie Chart Category White or European American Black or African American Asian American American Indian or Alaska Native Native Hawaiian or other Pacific Islander Some other race Two or more races Not Hispanic nor Latino Non-Hispanic White or European American Non-Hispanic Black or African American Non-Hispanic Asian Non-Hispanic American Indian or Alaska Native Non-Hispanic Native Hawaiian or other Pacific Islander Non-Hispanic Some Other Race Non-Hispanic Two or more races Hispanic or Latino White or European American Hispanic Black or African American Hispanic American Indian or Alaska Native Hispanic Asian Hispanic Some Other Race Hispanic Two or more races Hispanic 600 P a g e |8 Exercises 1.5 1.5.1 1 7 6 1 0 5 . . . 7 89 6 x 1 6 5 . 6 7 1 2 1 7 6 1 6 5 . 6 7 1 0 5 1 6 5 . 6 7 . . . 7 8 1 6 5 . 6 7 9 6 1 6 5 . 6 7 s 2 2 2 2 2 1 2 1 s 3 9 8 8 . 4 2 2 s 3 9 8 8 . 4 26 3 . 1 5 1.5.3 Given information: mean=6 , median = 4 , mode = 3 We know that the value 3 can only be in the data twice. If not the median would be different than 4. This give us the following: 3, 3, x, y. Where x and y are the missing values. We introduce a system of equation to solve for x and y. 39xy 6 4 xy2 46 xy1 8 y1 85 y1 3 ,x = 5 x3 4 2 x83 x5 Data: 3, 3, 5, 13 1 2 2 2 2 Var 36 36 56 136 3 1 = 68 3 =22.667 Sd= Var = 22.667 =4.76 1.5.5 Q1 80 Q3 115 (a) M 95 IQR 115 80 35 LL 80 1.535 27.5 LL 115 1.535 167.5 P a g e |9 40 60 80 100 120 (b) (c) There are no outliers. 1.5.7 l l l f ( m x ) f ( m ) f ( x ) n x n x 0 (a) i i i i i i 1 i 1 i 1 5 f m x 5 2 1 1 . 8 1 4 7 1 1 . 8 1 5 1 2 1 1 . 8 1 0 1 7 1 1 . 8 6 2 2 1 1 . 8 (b) i i i 1 5 9 . 8 1 4 4 . 8 1 5 . 2 1 0 5 . 2 6 1 0 . 2 4 9 6 7 . 2 3 5 2 6 1 . 2 0 1.5.9 3 2 x i 1 0 5 9 .3 6 xi1 3 3 .1 0 5 3 2 3 2 (a) 132 4 8 8 .3 3 2 2 5 s2 1 7 7 .0 4 3 xi 33.105 3 1i1 3 1 ra n g e5 3 .5 0 5 .3 14 8 .1 9 P a g e | 10 24.75 25.44 25.095 2 42.19 43.25 Q3 42.72 2 32 32 32 (b) M 2 IQR 42.72 25.095 17.625 Q1 LL 25.095 1.5 17.625 1.3425 LL 42.72 1.5 17.625 69.1575 10 20 30 40 50 There are no outliers. (c) 4 0 2 Frequency 6 8 Histogram of y 0 (d) 10 20 30 y 40 50 60 P a g e | 11 (e) x 33.105 x s 1 9 .8 0 ,4 6 .4 1 21 data point (65.625%) fall within 1 SD, empirical rule = 68% x 2 s .4 9 ,5 9 .7 2 6 31 data point (96.875%) fall within 2 SD, empirical rule = 95% x 3 s 6 .8 1 ,7 3 .0 2 32 data point (100%) fall within 3 SD, empirical rule = 99.7% 1.5.11 2 1 1 , 0 0 0 2 1 1 1 , 0 0 0 x 1 1 0 , s 1 , 9 0 0 , 0 0 0 6 9 6 9 6 9 7 1 0 0 1 0 0 1 1 0 0 (a) s 6 9 6 9 . 6 9 7 8 3 . 4 8 4 7 xs . 6 8 ( 4 0 0 ,0 0 0 ) 2 7 2 ,0 0 0 2 s . 9 5 ( 4 0 0 ,0 0 0 ) 3 8 0 ,0 0 0 (b) x x 3 s . 9 9 7 ( 4 0 0 ,0 0 0 ) 3 9 8 ,8 0 0 1.5.13 (a) x 112.33.7433 x 30 i1 i 30 30 30 1 2 s2 xi 3.7433 3.502 29i1 s 3.5021.871 (b) Frequency table Class Interval 1 0-1.6 2 1.7-3.3 3 3.4-5 4 5.1-6.7 5 6.8-8.4 (c) Grouped data: Frequency 4 10 9 5 2 Mi .8 2.5 4.2 5.9 7.6 Mi∙fi 3.2 25 37.8 29.5 15.2 4 ( . 8 ) 1 0 ( 2 . 5 ) 9 ( 4 . 2 ) 5 ( 5 . 9 ) 2 ( 7 . 6 ) x 3 . 6 9 3 0 2 2 2 2 2 21 s 4 0 . 8 3 . 6 9 1 0 2 . 5 3 . 6 9 9 4 . 2 3 . 6 9 5 5 . 9 3 . 6 9 2 7 . 6 3 . 6 93 . 6 2 2 9 s 3 . 6 2 1 . 9 0 The results from the grouped data are similar to the actual data. P a g e | 12 1.5.15 L 25 fm 139615 w4 Fb 178859 n514661 w M L (. 5 n F ) 27 . 24822 b f m 1.5.17 3 8 1 0 3 1 2 9 . 5 5 9 4 9 . 5 4 5 6 9 . 5 7 8 9 . 5 x 4 4 . 2 7 2 1 8 0 2 2 2 2 2 21 s 3 8 1 0 4 4 . 2 7 2 3 1 2 9 . 5 4 4 . 2 7 2 5 9 4 9 . 5 4 4 . 2 7 2 4 5 6 9 . 5 4 4 . 2 7 2 7 8 9 . 5 4 4 . 2 7 2 4 9 5 3 6 . 1 4 6 s 5 3 6 . 1 4 6 2 3 . 1 5 5 (b) L 40 , f m 59 , w 19 , Fb 69