Uploaded by Mae Manaog

[PDF Lecture] Introduction to Statistics and Data Analysis

advertisement
INTRODUCTION TO STATISTICS AND DATA
ANALYSIS
ENGINEERING DATA ANALYSIS
THE CHALLENGE
With the advancement in sciences and engineering occurring in large part through the
collection and analysis of data, proper analysis of data can be challenging, because scientific
data are subject to random variation.
How can one draw conclusions from the results of an experiment when those results could
have come out differently?
The method of statistics allow scientists and engineers to design valid experiments and to
draw reliable conclusions from the data they produce.
THE ENGINEERING METHOD AND STATISTICAL THINKING
Many of the engineering sciences are employed in the
engineering problem-solving method:
 mechanical sciences, such as statics and dynamics
 fluid science
 thermal sciences such as thermodynamics and
Engineering data can be collected by
 Retrospective study
 Observational study
 Designed experiment
heat transfer
 electrical sciences
 materials science
 chemical sciences
THE BASIC IDEA
The basic idea behind all statistical methods of data analysis is to make inferences
about a population by studying a relatively small sample from it.
For example, consider a machine that makes steel balls for ball bearings used in clutch
systems. The specification for the diameter of the balls is 0.65 ± 0.03 cm. During the last
hour, the machine has made 2000 balls. The QE wants to know how many of these balls
meet the specifications. He does not have the time to measure all 2000 balls, so he
draws a random sample of 80 balls, 72 of which (90%) meet the specifications. (How
can he be sure that 90% of the whole population meet the specifications)?
TWO FIELDS OF STATISTICS
 INFERENTIAL STATISTICS is the process of using data analysis to make
predictions (“inference”) from that data.
 DESCRIPTIVE STATISTICS are used to describe the basic features in the study,
in the form of charts, graphs, plots, etc.
COLLECTING ENGINEERING DATA
Sample
Population
DEFINITION
• A population is the entire collection of objects or outcomes about which
information is sought.
• A sample is a subset of a population, containing the objects or outcomes that are
actually observed.
TANGIBLE VS. CONCEPTUAL POPULATIONS
DEFINITION
• A tangible population is a population consist of actual physical objects that are
countable and always finite.
• A conceptual population happens when all the values that might possibly occur
have been observed from a simple random sample. A simple random sample may
consist of values obtained from a process under identical experimental conditions.
Example: Each of the following processes involves sampling from a population. Define the
population, and state whether it is tangible or conceptual.
• A shipment of bolts is received from a vendor. To check whether the shipment is acceptable
with regard to shear strength, an engineer reaches into the container and selects 10 bolts,
one by one to test.
• The resistance of a certain resistor is measured 5 times with the same ohmmeter.
SAMPLING
DEFINITION
• A simple random sample of size n is a sample chosen by a method in which each
collection of n population items is equally likely to comprise the sample, just as in a
lottery.
Think of a lottery consisting of 10,000 tickets and 5 winners will be chosen. What is the
fairest way to choose the winners?
SAMPLING
EXAMPLE:
A utility company wants to conduct a survey to measure the satisfaction level of its
customers in a certain town. There are 10,000 customers in the town, and utility
employees want to draw a sample of size 200 to interview personally. They obtain a list
of all 10,000 customers, and number them from 1 to 10,000. They use a computer
random number generator to generate 200 random integers between 1 and 10,000 and
then contact the customers who correspond to those numbers. Is this a simple random
sample?
SAMPLING
EXAMPLE:
A quality engineer wants to inspect electronic microcircuits in order to obtain
information on the proportion that are defective. She decide to draw a sample of 100
circuit from a day’s production. Each hour for 5 hours, she takes the 20 most recently
produced circuits and tests them. Is this a simple random sample?
SAMPLING
EXAMPLE:
A construction engineer has just received a shipment of 1000 concrete blocks, each
weighing approximately 25 kilograms. The blocks have been delivered in a large pile. The
engineer wishes to investigate the compressive strength of the blocks by measuring the
strengths in a sample of blocks. What is the more appropriate method of selecting
random samples?
DEFINITION
• A sample of convenience is a sample that is not drawn by a well-defined random
method.
SAMPLING
If, for example, a quality inspector draws a random sample of 40 bolts from a large
shipment, measures the length of each and finds that 32 of them (80%) meet a length
specification. By chance, a second inspector got a few more good bolts, about 90% in
her sample. The proportion of good bolts in the population is likely to be close to 80%
or 90%, but it is not likely that it is exactly equal to either value.
DEFINITION
• A sampling variation happens when two or more different samples from the same
population will differ from each other as well.
SAMPLING
DEFINITION
• With sampling with replacement, what one gets in one sample does not affect what
one gets in a different sample. In this case, we say that the samples are
independent.
• With sampling without replacement, what one gets in one sample does affect what
one gets in a different sample. In this case, we say that the samples are dependent.
An urn contains five balls numbered 1 through 5. I pick two balls and write down their
numbers and place them back in the urn. Then I pick another two balls and write down
their numbers.Are the two samples dependent or independent?
SAMPLING
OTHER SAMPLING METHODS
• Weighted sampling is when some items are given a greater chance of being
selected than others (ex., lottery in which some people have more tickets than
others.)
• Stratified random sampling is then the population is divided into subpopulations
known as strata, and a simple random sample is drawn from each stratum.
• Cluster sampling is when items are drawn from the population in groups or clusters.
TYPES OF DATA
DEFINITION
• When a numerical quantity designating how much or how many is assigned to each
item in a sample, the resulting set of values is called numerical or quantitative.
• In some cases, if sample items are placed into categories, and category names are
assigned to the sample items, the data are categorical or qualitative.
Example:
In a loading test of column-to-beam welded connections, data may be collected
both on the torque applied at failure and on the location of the failure (weld or beam).
Quantitative variable: Torque
Qualitative variable:
Location (weld or beam)
SUMMARY
STATISTICS
SAMPLE MEAN
The sample mean, also known as the “arithmetic mean” or the “average” is the sum
of the numbers in a sample, divided by how many there are.
DEFINITION
Let 𝑋1 , … , 𝑋𝑛 be a sample. The sample mean is:
𝑛
1
ത
𝑋 = ෍ 𝑋𝑖
𝑛
𝑖=1
SAMPLE VARIANCE AND STANDARD DEVIATION
The sample standard deviation is a quantity that measures the degree of spread in a
sample. The square of the sample standard deviation is the sample variance.
DEFINITION
Let 𝑋1 , … , 𝑋𝑛 be a sample. The sample variance is the quantity:
𝑛
1
2
𝑠 =
෍ 𝑋𝑖 − 𝑋ത 2
𝑛−1
𝑖=1
An equivalent formula can be used:
𝑠2
𝑛
1
=
෍ 𝑋𝑖2 − 𝑛𝑋ത 2
𝑛−1
𝑖=1
SAMPLE VARIANCE AND STANDARD DEVIATION
DEFINITION
Let 𝑋1 , … , 𝑋𝑛 be a sample. The sample standard deviation is the quantity:
𝑛
1
෍ 𝑋𝑖 − 𝑋ത
𝑛−1
𝑠=
2
𝑖=1
An equivalent formula can be used:
𝑛
𝑠=
1
෍ 𝑋𝑖2 − 𝑛𝑋ത 2
𝑛−1
𝑖=1
OUTLIERS
Sometimes, a sample may contain a few points that are much larger or smaller than the
rest. Such points are called outliers. This may result from data entry errors, and needs
to be scrutinized and should be corrected or deleted.
SAMPLE MEDIAN
The median is a measure of center.
DEFINITION
If n numbers are ordered from smallest to largest:
𝑛+1
• If n is odd, the sample median is the number in the position 2 .
𝑛
• If n is even, the sample median is the average of the numbers in the positions 2 and
𝑛
2
+1
QUARTILES
If the median divides the sample in half, quartiles divide it nearly as possible into
quarters. A sample has 3 quartiles.
Let n represent the sample size.
First quartile: 0.25(𝑛 + 1)
Second quartile: 0.50(𝑛 + 1)
Third quartile: 0.75(𝑛 + 1)
Note that the second quartile is the same as the median.
QUARTILES
Example:
In the article “Evaluation of Low-Temperature Properties of HMA Mixtures” (P. Sebasly,
A. Lake, and J. Epps, Journal of Transportation Engineering, 2002-578-583), the following
values of fracture stress (in MPa) were measured for a sample of 22 mixtures of hotmixed asphalt (HMA).
30
75
79
80
80
105
126
138
149
179
191
223
232
236
240
242
245
247
254
274
384
470
Find the first and third quartiles.
PERCENTILES
The pth percentile of a sample, for a number p between 0 and 100, divides the sample
so that as nearly as possible p% of the sample values are less than the pth percentile
and (100-p)% are greater.
Let n represent the sample size.
pth percentile:
p
(𝑛
100
+ 1)
Note that the 25th percentile is the 1st quartile, the median is the 50th percentile and 2nd
quartile, and the 75th percentile is the 3rd quartile. If the quantity is an integer, that is the
percentile, otherwise, get the average of the two sample values on either side.
PERCENTILES
Example:
In the article “Evaluation of Low-Temperature Properties of HMA Mixtures” (P. Sebasly,
A. Lake, and J. Epps, Journal of Transportation Engineering, 2002-578-583), the following
values of fracture stress (in Mpa) were measured for a sample of 22 mixtures of hotmixed asphalt (HMA).
30
75
79
80
80
105
126
138
149
179
191
223
232
236
240
242
245
247
254
274
384
470
Find the 65th percentile.
GRAPHICAL
SUMMARIES
STEM-AND-LEAF PLOT
Example:
The table below shows a study of the bioactivity of a certain antifungal drug. The
drug was applied to the skin of 48 subjects. After 3 hours, the amount of drug remaining
in the skin were measured in units of ng/cm2. The list has been sorted in numerical
order.
3
15
22
27
40
4
16
22
33
41
4
16
22
34
41
7
17
23
34
51
7
17
24
35
53
8
18
25
36
55
9
20
26
36
55
9
20
26
37
74
12
21
26
38
12
21
26
40
STEM-AND-LEAF PLOT
Stem-and-leaf plot:
Stem
Leaf
0 34477899
1 22566778
2 001122234566667
3 34456678
4 0011
5 1355
6
7 4
3
15
22
27
40
4
16
22
33
41
4
16
22
34
41
7
17
23
34
51
7
17
24
35
53
8
18
25
36
55
9
20
26
36
55
9
20
26
37
74
12
21
26
38
12
21
26
40
DOTPLOT
A dotplot is a graph that can be used to give a rough impression of the shape of a sample, useful when the sample
size is not too large and when the sample contains some repeated values.
HISTOGRAM
A histogram is a graphic that gives an idea of the “shape” of a sample, indicating
regions where sample points are concentrated and regions where they are sparse.
Example:
The table on shows PM emissions of 62 vehicles driven at high altitude.
7.50
6.28
6.07
5.23
5.54
3.46
2.44
3.01
13.63
13.02
23.38
9.24
3.22
2.06
4.04
17.11
12.26
19.91
8.50
7.81
7.18
6.95
18.64
7.10
6.04
5.66
8.86
4.40
3.57
4.35
3.84
2.37
3.81
5.32
5.84
2.85
4.68
1.85
9.14
8.67
9.52
2.68
10.14
9.20
7.31
2.09
6.32
6.53
6.32
2.01
5.91
5.60
5.61
1.50
6.46
5.29
5.64
2.07
1.11
3.32
1.83
7.56
HISTOGRAM
Class interval (g/gal)
Frequency
Relative frequency
1≤x <3
12
0.1935
3≤x<5
11
0.1774
5≤x<7
18
0.2903
7≤x<9
9
0.1452
9 ≤ x < 11
5
0.0806
11 ≤ x < 13
1
0.0161
13 ≤ x < 15
2
0.0323
15 ≤ x < 17
0
0.0000
17 ≤ x < 19
2
0.0323
19 ≤ x < 21
1
0.0161
21 ≤ x < 23
0
0.0000
23 ≤ x < 25
1
0.0161
Example:
The table on shows PM emissions of 62
vehicles driven at high altitude.
Construct a frequency table.
Data will be counted into several class
intervals. There is no hard and fast rule
as to how to decide how many class
intervals to use.
HISTOGRAM
HISTOGRAM
Class interval
(g/gal)
Frequency
Relative
frequency
1≤x <3
12
0.1935
3≤x<5
11
0.1774
5≤x<7
18
0.2903
7≤x<9
9
0.1452
9 ≤ x < 11
5
0.0806
11 ≤ x < 13
1
0.0161
13 ≤ x < 15
2
0.0323
15 ≤ x < 17
0
0.0000
17 ≤ x < 19
2
0.0323
19 ≤ x < 21
1
0.0161
21 ≤ x < 23
0
0.0000
23 ≤ x < 25
1
0.0161
To construct a histogram: (1) determine the number of classes to use and construct intervals of equal
width; (2) compute the frequency and relative frequency for each class; and, (3) draw a rectangle for each
class, the heights of the rectangles may be set equal to the frequencies or to the relative frequencies.
SKEWNESS
Skewness refers to the asymmetry of a histogram; a symmetric histogram has its
right half a mirror image of its eft half. A histogram skewed to the left or negatively
skewed has long left-hand tail. On the same hand, a histogram skewed to the right
or positively skewed has long right-hand tail.
HISTOGRAM MODES
Histogram mode refers to the “peak”, or local maximum in a histogram. A histogram is said to be
unimodal if it has only one peak or mode, and bimodal if it has two clearly distinct modes.
Bimodal histogram indicates that the sample can be divided into two subsamples that differ from each
other in some scientifically important way.
Download