Lesson 8 Introduction to Statistics

advertisement
 Statistics
is the branch of mathematics
that examines ways to process and
analyze data.
 Statistics, branch of mathematics that
deals with the collection, organization,
and analysis of numerical data and with
such problems as experiment design and
decision making.
 A Statistic is any quantity whose value
can be calculated from sample data.
 Make
interested from data
 Application
with unrealism
 Sampling
 Relation
analysis
 Forecasting
 Decision
under unrealism
A
population consists of all of the
members of a group about which you
want to draw a conclusion.
 A sample is the portion of the population
selected for analysis.
 A parameter is a numerical measure that
describes a characteristic of a population
 A statistic is a numerical measure that
describes a characteristic of a sample
 Population: all
the students at a university,
all the registered voters in Svay Rieng…
 Sample: selected from above population. 10
students selected, 500 registered voters
who participated in a survey.
 The average grade of all the students this
semester is a parameter.
 The average grade of 10 students selected
is a statistic. Information from only 10
students is used in calculating statistic.
 Descriptive
statistics focuses on
collecting, summarizing, and presenting
a set of data. These activities are also
known as primary analyses.
 Inferential statistics uses sample data
to draw conclusion about a population.
These activities are also known as
secondary analyses.
 Example:
The final score of students are
84 49 61 40 83 67 45 66
70 69 80 58 68 60 67 72
73 70 57 63 70 78 52 67
53 67 75 61 70 81 76 79
75 76 58 95
Without any organization, it is difficult to get a
sense of what a typical or representative
score might be, whether the values are highly
concentrated about a typical value or quite
spread out, whether there are any gaps in the
data, what percentage of the values…
score Stem-and-Leaf Plot
Frequency
1.00
2.00
2.00
3.00
4.00
7.00
6.00
6.00
4.00
.00
.00
1.00
Stem width:
Each leaf:
Stem &
4
4
5
5
6
6
7
7
8
8
9
9
.
.
.
.
.
.
.
.
.
.
.
.
Leaf
0
59
23
788
0113
6777789
000023
556689
0134
5
10
1 case(s)
 Having
obtained a sample from a
population, an investigator would
frequently like to use sample information
to draw some type of conclusion (make
an inference of some sort) about the
population.
 That is, the sample is a means to an end
rather than an end in itself.
Population
Sample
 Variables
are characteristics of items or
individuals.
E.g. Variables are your gender, your major
field of study, the amount of money you
have in your wallet… So the key aspect of
variable is the idea that items differ and
people differ.
2 types of variable:
 Discrete: if its set of possible values
either is finite or else can be listed in an
infinite sequence (one in which there is a
first number, a second number and so on)
 Continuous: if its possible values consist
of an entire interval on the number line.
Variable is also divided into 2 types:
 Quantitative variable: Variable that can
be presented in number like income of
people, weight of boxers, etc.
 Qualitative variable: variable that cannot
be presented in number like gender,
living standard, etc.
 Primary
data: Original data collected
from source – experiments, survey, etc.
 Secondary data: Data extracted from
other reports or documents in which the
data has already been collected.
 Descriptive
statistics can be divided into
two general subject areas: visual
techniques and numerical summary
measures for data sets.
 Visual techniques: Frequency table,
histograms, pie charts, bar graphs,
scatter diagrams, etc.
 Numerical summary measures: Mean,
variance, standard deviation, etc.
 Frequency
: The number of times
something ( xi ) occurs noted by fi .
 Total Frequency: Sum of all frequencies
noted by N or n.
Total Frequency=N=n=fi
 Relative Frequency: the ratio of the
absolute frequency to the total frequency.
fi
Relative Frequency of a category=
N
 Cumulative
Frequency : the running total of
the frequencies
Cumulative Frequency=
f
i m
i
:m  n
 Relative
Cumulative Frequency: is the
cumulative frequency divided by total
frequency.
Relative cumulative frequency =
Cumulative frequency /total frequency
 The
final score of students:
84 49 61 40 83 67 45 66 70
69 80 58 68 60 67 72 73 70
57 63 70 78 52 67 53 67 75
61 70 81 76 79 75 76 58 95
Without any arrangement it is difficult to
understand.
Create a table of the total frequency,
relative frequency, ….
Steps for constructing a Stem-and-Leaf:
 Select one or more leading digits for the
stem values. The trailing digits become
the leaves.
 List possible stem values in a vertical
column.
 Record the leaf for every observation
beside the corresponding stem value.
 Indicate the units for stems and leaves
some place in the display.
 Suppose
135
209
162
165
salary of staffs are: 120
216 216 181 222 150
175 167 130 190 155
197 182 215 187 172
144 199
215
210
145
169
170
225
177
205
Stem
Leaf
Frequency
Accumulated
frequency
12
13
14
15
16
17
18
19
20
21
22
0
0,5
4,5
0,5
2,5,7,9
0,2,5,7
1,2,7
0,7,9
5,9
0,5,5,6,6
2,5
1
2
2
2
4
4
3
3
2
5
2
1
3
5
7
11
15
18
21
23
28
30
Total
Stem width: 10
Leaf: one case
30
By using SPSS, the stem-and-leaf shows:
salary Stem-and-Leaf Plot
Frequency
.00
3.00
4.00
8.00
6.00
7.00
2.00
Stem & Leaf
1.
1.
1.
1.
1.
2.
2.
233
4455
66667777
888999
0011111
22
Stem width:
100
Each leaf:
1 case(s)
 Class
refers to a group of objects with
some common property.
 Class boundary: is give by the midpoint
of the upper limit of one class and the
lower limit of the next class.
 Class width = Upper boundary - Lower
boundary
 CLASS
MIDPOINT or MARK=(Lower limit
+ Upper limit )/2
 Number of classes: generally is given by
2  n  k  log 2 n
k
k= Number of Classes
n= Number of Observations
Create a frequency distribution of student
score with Class of Tens (110,1120…)
 Consider
data consisting of observations
on a discrete variable x. the frequency of
any particular x value is the number of
times that the value occurs in the data
set.
 The relative frequency of a value is the
fraction or proportion of times the value
occurs.
 Frequency
distribution and histogram
can be constructed when the data set is
qualitative categorical in nature.
 Some classes have natural ordering – eg.
BAC2, Bachelor, Master, Doctor – and the
other case the order will be arbitrary –
eg. Cambodian, England, American,
French, Japanese…
 Bar
chart
2002
Year
2001
Rice(T)
2000
1999
0
 Pie
chart
10000
20000
30000
40000
70
 Frequency
Polygon or line
60
50
40
30
20
10
 Stock
chart
( Low-High-Close
chart)
0
[20-30]
]30-40]
]40-50]
]50-60]
]60-70]
184
183
182
181
180
High
179
Low
Close
178
177
176
175
9-Jun
10
11
12
13
A
survey of student rating show:
Rating
Frequency
A
478
B
893
C
680
D
178
F
100
Don’t know
172
 Construct
histogram
frequency distribution and
 For
two variables, we use contingency
table:
Age
Less than 25
25-40
40-65
65 and Over
Total
1-2
1 (0.02)
2 (0.04)
1 (0.02)
1 (0.02)
5 (0.10)
Number of Flight per Year
3-5
Over 5
1 (0.02)
8 (0.16)
6 (0.12)
2 (0.04)
17 (0.35)
2 (0.04)
10 (0.20)
15 (0.30)
1 (0.02)
28 (0.56)
Total
4 (0.08)
20 (0.40)
22 (0.44)
4 (0.08)
50 (1.00)
The sample mean x of observations x1,
x2,…….., xn is given by:
n
x1  x 2  ...  x n
x

n
x
i 1
n
i
The sample median is obtained by first ordering
the n observations from smallest to largest (with
any repeated values included so that every
sample observation appears in the ordered list).
~ equals
Then, x
th
 n1
 
 Ordered value = The single middle value
 2 
if n is odd
th
th
 n  n 
 Average of   ,   1  = The average of the
 2  2

two middle values if n is even
 Quartiles
divide the data into four parts,
first quartile, second quartile= median,
third quartile.
 Trimmed mean is a compromise
~ . A 10% trimmed
between x and x
mean would be computed by eliminating
the smallest 10% and the largest 10% of
the sample and then averaging what is
left over.
 Mean
and median give only partial
information about data set or distribution
 Different samples or populations may
have identical measure of center yet
differ from one another in other
important ways.
 The simplest measure of variability in a
sample is the range – the smallest and
the largest.
 Sample
variance, denoted by s2, is given
2
by
( xi  x )
S xx

2
s 

n1
n1
 Sample standard deviation, denoted by s,
is the (positive) square root of the
variance:
2
s s
noted that , 2 and  are used for
population and the divisor in 2 calculation
is n not n-1
 Be
Boxplot has been used successfully to
describe several of a data set’s most
prominent features:
 center
 spread
 the extent and nature of any departure
from symmetry and
 identification of outliers (observations
that lie usually far from the main body of
the data).
 Example
1.17 give the data of pit depth in
the crude oil plate as follows:
40 52 55 60 70 75 85 85 90 90 92 94
94 95 98 100 115 125 125
The five-number summary is as follows:
Smallest=40 Lower fourth=72.5
~x = 90
Upper fourth=96.5
Largest =125
Depth
40
50
60
70
80
90
100
110
120
130
140
Download