Statistics: - Texas A&M University

advertisement
STAT 211
Handout 1 (Chapter 1)
Overview and Descriptive Statistics
Statistics:
The branch of scientific inquiry that provides methods for organizing and
summarizing data and for using information in the data to draw various
conclusions.
Population: All individuals or objects of a particular type.
population size by N.
We will define the
Sample:
A portion or subset of the population. We will define the sample size by n.
Variable:
Any characteristics whose value may change from one object to another in
the population.
Example 1:
Population: engineering students in Texas A&M University
Sample: graduating engineering students in Texas A&M University
Variable: gender of graduating engineering students in Texas A&M
University or GPA of graduating engineering students in Texas A&M
University
Question:
Is the sample or the variable unique for the same population?
Question:
Population: all daily newspapers published in United States
Sample:
Variable:
Question:
Published papers propose that consumption of vitamin A prevents cancer.
How would you prove their proposal is supported by the data or not?
Question:
Studies show that smoking cigarette causes cancer and yellow fingers?
How would you prove if the data support this?
Descriptive statistics: Organizing and summarizing the data.
Inferential statistics: Drawing conclusions about the population based on sample
information.
DATA
 Univariate  Qualitative: Categorical
 Quantitative: Numerical
 Discrete
 Continuous
Example 2: Identify the following as categorical or numeric (if numeric, discrete or
continuous). Color of eyes, number of students play baseball in different schools, price of
your textbook, type of car each student drives, your height (in inches) or weight (in
pounds), zip code, actual weight of tea-leaves in a 1-lb package, number of customers
waiting in different banks.
 Bivariate
 Two groups
 Multivariate more than two groups
Tabular and Pictorial Methods for Describing Data
Given the data set of n observations on some variable X, the individual observations are
x1 , x2 ,......., xn . The ordered observations (if numeric from smallest to largest) will be
shown by x (i ) , i=1,2,....,n where x (i ) is the ith ordered value. n is the sample size and N
is the population size.
Stem–and–Leaf Display
Stem and leaf plots are very easy to create and look at the numeric data. An advantage to
this type of plot is that you can actually still see your data. How to make a stem-and-leaf:
1. Look at the range of your data.
2. Choose your stem – this is the leading digit(s). This is usually the 1’s,
10’s,100’s, etc. place
3. Add your leaf – this is the trailing digit(s). Some just plot the next digit
while others may plot the next few digits.
Example 3 (Exercise 1.14): Data set consist of observations on shower-flow rate, X
(L/min.) for 129 houses in Perth, Australia. Unordered data (x1=4.6. x2=12.3, x3=7.1,
…..,x127=6.3, x128=3.8, x129=6.0) are listed in the textbook. Data range 2.2 to 18.9. Thus I
will use the first digit (my 10’s place) as my stem and I will attach the leaf, which is the
next digit.
The following Minitab output summarizes the final result for the complete data:
Stem-and-leaf of Rate
Leaf Unit = 0.10
2
12
20
37
62
(17)
50
42
27
17
10
8
7
5
1
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
n
= 129
23
2344567789
01356889
00001114455666789
0000122223344456667789999
00012233455555668
02233448
012233335666788
2344455688
2335999
37
8
36
0035
9
The way the stem-and-leaf display is tabulated for the 6 data points (4.6, 12.3, 7.1, 6.3,
3.8, 6.0) selected from 129 data are as follows:
.
3|8
4|6
5|
6 | 03
7|1
8|
9|
10 |
11 |
12 | 3
Example 4 (Exercise 9.23): Fusible interlinings are being used with increasing frequency
to support outer fabrics and improve the shape and drape of various pieces of clothing.
The data on extensibility (%) at 100 gm/cm for both high-quality fabric (H) and poorquality fabric (P) specimens are as follows.
H
1.2 0.9 0.7 1.0 1.7 1.7 1.1 0.9 1.7 1.9 1.3 2.1 1.6
1.8
1.4 1.3
1.9
1.6
0.8
2.0 1.7
1.6
2.3 2.0
P
1.6
1.5 1.1
2.1
1.5
1.3
1.0 2.6
The following Minitab output summarizes the final result for the complete data using
different appearances of stem-and-leaf displays.
Stem-and-leaf of H:
Leaf Unit = 0.10
1
4
6
9
10
(7)
7
4
1
0
0
1
1
1
1
1
2
2
1
1
1
1
1
2
2
2
2
= 24
7
899
01
233
4
6667777
899
001
3
Stem-and-leaf of P:
Leaf Unit = 0.10
2
3
(2)
3
2
2
1
1
1
n
01
3
55
6
Stem-and-leaf of H:
Leaf Unit = 0.10
4
10
(10)
4
n
=
8
0
1
1
2
1
1
2
2
= 24
n
=
7899
012334
6667777899
0013
Stem-and-leaf of P:
Leaf Unit = 0.10
3
(3)
2
1
n
8
013
556
1
6
1
6
Frequency Distributions
For the numeric continuous (discrete) data, creates class intervals (lists the data points)
and counts the number of data falls into it. This count is called frequency. Relative
frequencies are obtained by dividing frequency by the total number of data. It is the
fraction or the proportion of time the interval is observed (the value occurs). For
categorical data, frequency is the number of data falls into each category.
Example 5 (Exercise 1.21): The number of intersections, Z is listed as one of the
characteristics of subdivisions.
z
0
1
2
3
4
5
6
7
8
Relative
Frequency
13/47=0.2766
11/47=0.2340
3/47=0.0638
7/47=0.1489
5/47=0.1064
3/47=0.0638
3/47=0.0638
0/47=0
2/47=0.0425
Count
13
11
3
7
5
3
3
0
2
Cumulative
relative frequency
13/47=0.2766
24/47=0.5106
27/47=0.5745
34/47=0.7234
39/47=0.8298
42/47=0.8936
45/47=0.9575
0.9575
47/47=1
n=47
What percentage of these subdivisions had at most 3 intersections?
What percentage of these subdivisions had less than 3 intersections?
What percentage of these subdivisions had between 2 and 5 (inclusive) intersections?
What percentage of these subdivisions had less than 2 and more than 5 intersections?
Histogram
A pictorial representation of a frequency distribution can be obtained by constructing a
histogram. The histogram is a much better way of visualizing a data set than the stemand-leaf. The following is the histogram for example 5.
Frequency
10
5
0
0
1
2
3
4
z:
5
6
7
8
How to construct a histogram for continuous data:
a) Divide range of observations into intervals ( Plot on x axis)
b) Count the # of observations that fall in each interval --- frequency.
# in interval
c) Compute the relative frequency =
(percentage falls into the interval)
size of data  n
d) Plot rectangle above each interval whose height is proportional to the relative
frequency or frequency. If all the intervals for the continuous data do not have the
same width, density (relative frequency/interval width) is a better measure to use for
histogram.
The following is the histogram and the frequency distribution for example 3 (flow rate).
40
Frequency
30
20
10
0
0
2
4
6
8
10
12
14
16
18
20
Rate
rate
Count
[1,3)
2
[3,5)
18
[5,7)
42
[7,9)
25
[9,11)
25
[11,13)
9
[13,15)
3
[15,17)
4
[17,19)
1
Relative
Cumulative
Frequency relative frequency
2/129
2/129  0.0155
18/129
20/129  0.1550
42/129
62/129  0.4806
25/129
87/129  0.6744
25/129
112/129  0.8682
9/129
121/129  0.9380
3/129
124/129  0.9612
4/129
128/129  0.9923
1/129
129/129  1
Rule of thumb: number of classes 
Density
1/129
9/129
21/129
12.5/129
12.5/129
4.5/129
1.5/129
2/129
0.5/129
numberofobservations
What to Look For In Your Graph: (Use with stem-and-leaf & histogram)
1. The center of the distribution
2. The overall Shape of the distribution.
Unimodal
 Symmetric – portions on each side of the center value are mirror images of
each other

Skewed left (negatively skewed) – the left tail (lower values) is stretched
out longer than the right tail (higher values)
 Skewed right (positively skewed) – the right tail (higher values) is
stretched out longer than the left tail (lower values)
Thus, whichever direction the curve is pulled – that is the direction in which it
is skewed.
Bimodal
Multimodal
3. Marked deviations from the overall shape of the distribution.
 Outliers – individual observations that fall well outside the overall pattern
of the graph
 Gaps in the distribution
For the intersections data, we see that the center in our distribution of intersections is in
2’s. The graph is skewed to the right with one major distinct peak (unimodal). 8
intersections may be outliers. Note one major gap.
For the flow rate data, we see that the center in our distribution of flow rates is in 7’s.
The graph is skewed to the right with one major distinct peak (unimodal). No major gaps
or outliers.
For the categorical data (the cars students drive), we can count the number of students for
each category of car defining the number of categories. We can use these counts on the
histogram vertical axis, categories horizontal axis. Placing a bar as high as the frequency
on the top of each category, histogram can be created.
Measures of location:
n
_
The sample mean, x 
x
i 1
i
is the arithmetic average.
n
N
The population mean,  
x
i 1
i
.
N
There is only one mean for a quantitative data set. Its value is influenced by
extreme measurements.
Note that the sample mean is the statistics where the population mean is the
parameter.
~
The sample median, x is the middle value when the measurements are arranged from
lowest to highest. If n is odd, the median is the observation which have exactly (n-1)/2
values are greater than and (n-1)/2 values are less than the median. If n is even, the
median is the average of the two middle values and n/2 values are greater than and n/2
values are less than the median.
There is only one median for the quantitative data and its value is not likely
influenced by few extreme measurements.
The mode is the most frequently occurring value. This measure may not be unique in that
two (or more) values may occur with the same greatest frequency.
There can be more than one mode for a data set. It is applicable for both
quantitative and qualitative data. Its value is not likely influenced by few extreme
measurements.
Note that there is negatively skewed distribution if mean < median, positively skewed
distribution if median<mean and symmetric distribution if mean = median.
Quartiles divide the data set into four equal parts.
Lower Quartile(Q1 ): The smallest 25% of the data. It can also be computed finding the
median of the smallest n/2 observations if n is even and median of the
smallest (n+1)/2 observations if n is odd. Your textbook calls this as lower
fourth.
Upper Quartile(Q3 ): The smallest 75% of the data. It can also be computed finding the
median of the largest n/2 observations if n is even and median of the
largest (n+1)/2 observations if n is odd. Your textbook calls this as upper
fourth.
Percentiles divide the data set into 100 equal parts. The pth percentile is the observation
in the data set where p% are equal to or less than this observation. To
calculate the pth percentile, x[p]
- order the data from smallest to largest
- let ip=np/100
- find the ith index such that i > ip
 x(i 1)  x(i )
, if i  1  i p

th
- the p percentile is x[ p ]  
2
 x(i ) ,
otherwise

Trimmed mean is a compromise between the mean and the median. A 5% trimmed
mean would be computed by eliminating the smallest 5% and the largest
5% of the sample and averaging what is left over.
Sample proportion is the number of successes divided by the total number of
observations.
Measures of variability:
The sample range measures the distance between the largest and smallest observations.
R = x ( n )  x (1) .
It is sensitive to outliers and provide no information on patterns of variability.
Interquartile range ( IQR=Upper Quartile - Lower Quartile= Q3 - Q1 ) : It is the range of
middle half of the distribution. Your text book calls it as fourth spread.
It is not sensitive to outliers.
n
_
_


Deviations from the mean: xi  x where   xi  x  = 0.

i 1 
 n 
  xi 
n
n
_
2
2
( xi  x )
xi   i 1 


n
The sample variance and standard deviation, are s 2  i 1
 i 1
n 1
n 1
and
2
s  s 2 , respectively with given sample size n where the population variance and
N
standard deviation are  2 
 (x
i 1
i
 )2
N
and  =  2 , respectively with given
population size N.
It is the most commonly used measure of variability and sensitive to outliers.
Not that the sample variance or standard deviation are statistics where the
population variance or standard deviation are parameters.
Coefficient of variation (CV): Unit free variation (amount of variability relative to the
value of the mean) where variance and standard deviation measures the variability
_
dependent on units of measurements. CV=100( s / x ).
Example 6: When the heights of students (in inches) and their weights (in pounds) are
recorded, the data set with more variation is measured by CV.
The following Minitab output summarizes some of the measures of location and
variability for flow rate data.
Variable
Rate
n
129
Mean
7.708
Median
7.000
TrMean
7.540
Variable
Rate
Minimum
2.200
Maximum
18.900
Q1
5.600
Q3
9.600
StDev
3.077
SE Mean
0.271
The following Minitab output summarizes some of the measures of location and
variability for the extensibility of high-quality versus low-quality fabric.
Variable
H:
P:
Variable
H:
P:
n
24
8
Mean
1.5083
1.588
Minimum
0.7000
1.000
Median
1.6000
1.500
Maximum
2.3000
2.600
TrMean
1.5091
1.588
Q1
1.1250
1.150
StDev
0.4442
0.530
Q3
1.8750
1.975
SE Mean
0.0907
0.188
The following Minitab output summarizes some of the measures of location and
variability for intersections data.
Variable
z:
n
47
Mean
2.277
Median
1.000
TrMean
2.116
Variable
z:
Minimum
0.000
Maximum
8.000
Q1
0.000
Q3
4.000
StDev
2.253
SE Mean
0.329
Example 7:
 Suppose X is a random variable with the values –100, -50, 0, 50, 100. Define some of
the measures of location and variability.
 Suppose X is a random variable with the values –200, -100, -50, 0, 50, 100. Define
some of the measures of location and variability.
 If the sample mean is 50 for 10 observations and 11th observation is 50, what would
be the sample mean of 11 observations?
 If the sample mean and variance are 50 and 3.25 for 10 observations and 11 th
observation is 50, what would be the sample variance of 11 observations?
 If the deviations from the mean for 5 observations are –0.3, 0.1, 2, 1.4, -1.7, what
would be the sum of the remaining 5 deviations from the mean where the data set
have 10 observations.
Question: Let c be a constant, X &Y be random variables. How would the mean and
variance change if you
 add the same constant to the each observation (yi=xi+c, i=1,2,….,n)
 multiply each observation with the same constant (yi=cxi, i=1,2,….,n)
Boxplots
Boxplots are formed using what is called the five number summary:
1. minimum
2. first (lower) quartile, 25th percentile, Q1.
3. median, 50th percentile, Q2.
4. third (upper) quartile, 75th percentile, Q3.
5. maximum
Ideal for comparing two populations (samples) when measuring a continuous random
variable.
1. The ends of the box are at the quartiles. The length of the box is Q3-Q1. This box will
contain 50% of the data values
2. The median is marked by a line within the box
3. The two vertical lines (called whiskers) outside the box extend to the smallest and
largest observations within 1.5 x IQR of the edges of the box.
4. Observations outside of these whiskers (that is, farther away than 1.5 X IQR beyond
edge of box) are called outliers. In general, outlier is the observation which is much
larger or smaller than the rest of the data. If the data falls between 1.5IQR and 3IQR
from the edge to which it is closest, they are called mild outliers. If the data fall more
than 3IQR from the edge to which it is closest, they are called extreme outliers.
Comparative boxplot for the extensibility of high-quality versus low-quality fabric:
2.5
Extensibility 1.5
0.5
H:
P:
Boxplot for flow rate:
Rate
20
10
0
Boxplot for intersections:
10
8
6
4
2
0
-2
z:
For the boxplots on the previous page,
1.
H
1.5(1.875-1.125)=1.125
P
1.5(1.975-1.15)=1.2375
Rate 1.5(9.6-5.6)=6
Z
1.5(4-0)=6
Q1-1.5∙IQR
0
-0.0875
-0.4
-6
Q3+1.5∙IQR
3
3.2125
15.6
10
INTERPRETING BOXPLOTS
 Note the position of the median. Medians not in the middle of the box can indicate
skewness in the middle 50% of the data as well as in the whole data set. Recall that
the mean will get drawn in the direction of the “skewness”. Thus the box will be a lot
longer in the direction of the skewness.
 Note the length of the whiskers and the outliers. If the data is symmetric, the
whiskers will be of equal length.
Coefficient of Skewness (SK): The direction of and degree to which a frequency
distribution is skewed. (SK<0  negatively skewed, SK=0  symmetric, SK>0 
positively skewed).
Example 8: A computer scientist is investigating the usefulness of two different design
languages in improving programming tasks. Twelve expert programmers, familiar with
both languages, are asked to code a standard function in both languages, and the time in
minutes is recorded.
Programmer
1 2 3 4 5 6 7 8 9 10 11 12
Design language 1 17 16 21 14 18 24 16 14 21 23 13 18
Design language 2 18 14 19 11 23 21 10 13 19 24 15 20
The following are the descriptive statistics and the comparative boxplots
obtained by MINITAB.
Variable
n
Mean
Median
TrMean
StDev
SE Mean
Design1
12
17.92
17.50
17.80
3.63
1.05
Design2
12
17.25
18.50
17.30
4.59
1.33
Variable
Design1
Design2
Minimum
13.00
10.00
Maximum
24.00
24.00
Q1
14.50
13.25
26
24
22
20
18
16
14
12
10
8
DESIGN1
DESIGN2
Q3
21.00
20.75
Example 9 (Exercise 1.60): Observations on burst strength (lb/in2) were obtained both for
test nozzle closure welds and for production canister nozzle welds.
The following are the descriptive statistics and the comparative boxplots
obtained by MINITAB.
Variable
Test
Cannister
n
11
12
n*
1
0
Mean
7355
5887.5
Median
7300
5887.5
TrMean
7389
5880.0
Variable
Test
Cannister
SE Mean
185
91.8
Minimum
6100
5250.0
Maximum
8300
6600.0
Q1
7200
5725.0
Q3
8000
6037.5
Strength
8000
7000
6000
5000
Test
Cannister
StDev
614
317.9
Download