Ch 6B Random Sampling &Data Descriptions

advertisement
Chapter 6 - Random
Sampling and Data
Description
More joy of dealing with large quantities of data
You can never have
too much data.
Chapter 6B
Today in Prob & Stat
6-2 Stem-and-Leaf Diagrams
Steps for Constructing a Stem-and-Leaf Diagram
6-2 Stem-and-Leaf Diagrams
Example 6-4
Figure 6-4
Stem-and-leaf diagram
for the compressive
strength data in Table
6-2.
Figure 6-5
too few
just right
25 observations on batch yields
Stem-and-leaf displays for
Example 6-5. Stem: Tens digits.
Leaf: Ones digits.
too many
Figure 6-6
Stem-and-leaf
diagram from
Minitab.
Number of observations
In the middle stem
6-4 Box Plots
• The box plot is a graphical display that
simultaneously describes several important features of
a data set, such as center, spread, departure from
symmetry, and identification of observations that lie
unusually far from the bulk of the data.
• Whisker
• Outlier
• Extreme outlier
Figure 6-13
Description of a box plot.
Figure 6-14
Box plot for compressive strength data in Table 62.
Figure 6-15
Comparative box
plots of a quality
index at three plants.
6-5 Time Sequence Plots
• A time series or time sequence is a data set in which the
observations are recorded in the order in which they occur.
• A time series plot is a graph in which the vertical axis
denotes the observed value of the variable (say x) and the
horizontal axis denotes the time (which could be minutes, days,
years, etc.).
• When measurements are plotted as a time series, we
often see
•trends,
•cycles, or
•other broad features of the data
Figure 6-16
Company sales by year (a) and by quarter (b).
Figure 6-17 gosh! – a stem and leaf
diagram combined with a time series plot
A digidot plot of the compressive strength data in Table 6-2.
Figure 6-18
A digidot plot of chemical process concentration readings, observed
hourly.
6-6 Probability Plots
• Probability plotting is a graphical method for determining
whether sample data conform to a hypothesized distribution
based on a subjective visual examination of the data.
• Probability plotting typically uses special graph paper, known
as probability paper, that has been designed for the
hypothesized distribution. Probability paper is widely available
for the normal, lognormal, Weibull, and various chi-square and
gamma distributions.
Probability (Q-Q)* Plots
•Forget ‘normal probability paper’
•Plot the z score versus the ranked observations, x(j)
•Subjective, visual technique usually applied to test
normality. Can also be adapted to other distributions.
•Method (for normal distribution):
•Rank the observations x(1), x(2), …, x(n) from smallest
to largest
•Compute the (j-1/2)/n value for each x(j)
•Plot zj=F-1((j-1/2)/n) versus x(j)
Parentheses
usually indicate
ordering of data.
Computing zj, where zj = F-1(j – ½)/n
29 values xj values
1
4.07
2
4.88
3
5.10
4
5.26
5
5.27
25
26
27
28
29
5.65
5.75
5.79
5.85
5.86
zj
(j-1/2)/n
-2.11
0.017
-1.63
0.052
-1.36
0.086
-1.17
0.121
-1.01
0.155
1.01
1.17
1.36
1.63
2.11
xj values are ordered
least to greatest
0.845
0.879
0.914
0.948
0.983
Example in EXCEL – Table 6-6, pp. 214
Cavendish Earth Density Data
2.50
2.00
zj is the function NORMSINV
1.50
zj values
1.00
0.50
0.00
-0.50 4.0
4.5
5.0
-1.00
-1.50
-2.00
-2.50
xj values
5.5
6.0
Example in EXCEL – Table 6-6, cont’d
Cavendish Earth Density Data
(censored)
2.00
1.50
zj is the function NORMSINV
zj values
1.00
0.50
0.00
-0.50 4.0
4.5
5.0
-1.00
-1.50
-2.00
-2.50
xj values
5.5
6.0
Example 6-7
Example 6-7 (continued)
Figure 6-19
Normal probability
plot for battery life.
Figure 6-20
Normal probability plot
obtained from
standardized normal
scores.
Figure 6-21
Normal probability plots indicating a nonnormal distribution.
(a) Light-tailed distribution. (b) Heavy-tailed distribution. (c ) A
distribution with positive (or right) skew.
The Beginning of a
Comprehensive
Example
Descriptive Statistics in Action
see real numbers, real data
 watch as they are manipulated in perverse ways
 be thrilled as they are sorted
 and be amazed as they are compressed into a single numbers

The Raw Data

As part of a life span study of
a particular type of lithium
polymer rechargable battery,
120 batteries were operated
and their life span in
operating hours determined.
1676.5
895.6
1682.0
1913.6
2881.9
2007.8
3313.4
2156.4
1954.7
2210.4
1630.3
1818.8
1779.5
984.2
1512.6
2046.1
1613.3
2066.1
2926.9
1995.7
2386.6
1663.8
2045.9
1985.2
1387.6
718.3
1088.7
1879.4
2056.6
1740.2
2791.8
2476.0
845.1
1581.8
2713.7
2238.5
1314.2
729.3
1898.7
1377.2
1347.6
2420.6
2450.0
2319.7
2560.1
884.1
596.2
1779.7
908.3
955.4
2383.4
1577.6
2365.4
1527.9
2749.2
2439.7
2016.2
1757.8
1022.7
2063.8
1840.2
943.5
2210.5
2856.3
745.0
2125.3
1759.9
1297.0
2210.1
543.4
891.5
1818.8
1803.7
1460.3
1753.3
2633.1
4300.8
1250.8
1005.2
667.1
916.0
1351.9
1823.0
1944.9
1641.3
1694.0
1378.0
849.4
1882.6
2323.8
807.0
2088.8
2940.7
2004.6
1714.3
2039.1
1760.5
577.8
1945.6
1299.9
Data generated from a Weibull distribution with  = 2.8 and  = 2000
1592.0
1395.4
2401.8
2968.7
1952.3
2430.5
999.1
1608.4
983.8
1831.1
1307.4
2139.0
1552.6
1808.1
2398.0
2398.8
2824.3
715.2
2277.3
1941.2
Descriptive Statistics Minitab
trimmed mean
Variable
N
Battery Life 120
Variable
Battery Life
Mean
Median
TrMean
1789.4
1813.4
1773.9
Minimum Maximum
543.4
4300.8
Q1
1348.7
StDev SE Mean
661.5
60.4
Q3
2210.3
More Minitab
Histogram of Battery Life
Frequency
20
10
0
0
500
1000
1500
2000
2500
Battery Life
3000
3500
4000
4500
More Minitab
Histogram of Battery Life, with Normal Curve
Frequency
20
10
0
0
500
1000
1500
2000
2500
Battery Life
3000
3500
4000
4500
Stem and Leaf Plot
Leaf Unit = 100
21
36
(40)
44
13
2
1
1
0 555677778888889999999
1 000222333333334
1 5555556666666677777777888888888899999999
2 0000000000111222223333333444444
2 56777888999
33
3
43
Dotplot for Battery Life
1000
2000
3000
Battery Life
4000
More Minitab
Boxplot of Battery Life
0
1000
2000
Battery Life
3000
4000
More Minitab
Descriptive Statistics
Variable: Battery Life
Anderson-Darling Normality Test
A-Squared:
P-Value:
700
1300
1900
2500
3100
3700
4300
95% Confidence Interval for Mu
0.566
0.139
Mean
StDev
Variance
Skew ness
Kurtosis
N
1789.45
661.53
437627
0.359617
0.736628
120
Minimum
1st Quartile
Median
3rd Quartile
Maximum
543.40
1348.68
1813.45
2210.32
4300.80
95% Confidence Interval for Mu
1669.87
1650
1750
1850
1950
1909.03
95% Confidence Interval for Sigma
587.10
757.74
95% Confidence Interval for Median
95% Confidence Interval for Median
1691.57
1945.04
Time Series Plot
Based upon the order that the data was generated
4000
3000
2000
1000
0
Index
20
40
60
80
100
120
Time Series Plot
Sorted by failure time
4000
sorted
3000
2000
1000
0
Index
20
40
60
80
100
120
Normal Probability Plot for Battery Life
ML Estimates
99
95
Percent
90
80
70
60
50
40
30
20
10
5
1
0
1000
2000
Data
3000
4000
Mean:
1789.45
StDev:
658.771
Percent
Weibull Probability Plot for Battery Life
ML Estimates
99
95
90
80
70
60
50
40
30
20
10
5
3
2
1
100
1000
Data
Shape:
2.92250
Scale:
2005.35
Exponential Probability Plot for Battery Life
ML Estimates
Mean:
Percent
99
98
97
95
90
80
70
60
50
30
10
0
5000
10000
Data
1789.45
Computer Support
This is easy if you use the
computer.
hang on, we are going to Excel…
A Recap …
•Population – the totality of observations with which we are
concerned. Issue: conceptual vs. actual.
•Sample – subset of observations selected from a population.
•Statistic – any function of the observations in a sample.
•Sample range – If the n observations in a sample are denoted
by x1, x2, …,xn, then the sample range is r = max(xi) – min(xi).
•Sample mean and variance.
n
x   xi
i1
n
s 
2
 (x  x )
i1
i
n 1
n
2

x
i1
2
i
 nx 2
n 1
Note that these are
functions of the
observations in a
sample and are,
therefore, statistics.
More Recapping …
Note terminology –
‘population parameter’ vs.
‘sample statistic’
N
   xi
i1
N
2 
 (xi  )
i1
N
2
N

2
2
x

N

 i
i1
N
Note difference in denominators
n
s 
2
 (x  x )
i1
i
n 1
n
2

x
i1
2
i
 nx 2
n 1
Sample variance uses an estimate of the mean (xbar) in its
calculation. If divided by n, the sample variance would be a
biased estimate – biased low.
Sampling Process




X a random variable that represents one selection
from a population.
Each observation in the sample is obtained under
identical conditions.
 The population does not change during sampling.
 The probability distribution of values does not
change during sampling.
f(x1,x2,…,xn) = f(x1)f(x2)…f(xn) if the sample is
independent.
Notation
 X1, X2,…, Xn are the random variables.
 x1, x2,…, xn are the values of the random variables.
A Final Recap…
A probability distribution is often a model for a population.
This is often the case when the population is conceptual or infinite.
The histogram should
resemble to distribution of
population values. The
bigger the sample the
stronger the resemblance.
Our Work Here Today
is Done
Next Week:
The Glorious Midterm
Prob/Stat students
Discussing stem and leaf plots
Download