Statistics is a branch of mathematics that deals with the collection

advertisement
CHAPTER 11. STATISTICAL ANALYSIS IN SCIENTIFIC RESEARCHES
11.1
Basics of Statistics
Statistics is a branch of mathematics that deals with the collection,
organization, and analysis of numerical data and with such problems as experiment
design and decision making.
The origin of the term statistics comes from the Italian word statista (meaning
“statesman”), but the real term derived from statista was statistik which was firstly
used by Gottfried Achenwall (1719-1772). He was a professor at Marlborough and
Gottingen. But the introduction of the word statistics was made by E.A.W.
Zimmerman in England. However, before eighteenth century people were able to
record and use some data. The popularity of statistics had started with Sir John Clair
in his work Statistical Account of Scotland which includes the period of 1791-99.
There are various techniques in statistics which can be applied in every branch
of public and private enterprises. But statisticians generally divide it into two main
parts: Descriptive Statistics and Inferential Statistics.
Shortly, in descriptive statistics there is no generalization from sample to
population1. We can describe any data with tables, charts or graphs so that they do not
refer any generalization for other data or population. On the other hand, in inferential
statistics there is a generalization from sample to population. The generalization or
conclusions on any data goes far beyond that data. So the generalization may not be
true and valid, and statistician should specify how likely it is to be true, because it
based on estimation somehow. Inferential statistics could be also re-called as
Statistical Inference. Statistical inference can be applied also in decision theory,
which is a branch of statistics. Because there is a very close relationship between the
two; decisions are made under the conditions of uncertainty. So statistical inference is
very effective in decision making.
11.2 Arranging Data: Data Array, Frequency Distributions,
and Cross-Tabulations
Data are collections of any number of related observations. We can collect an
information about the number of students in Eastern Mediterranean University (EMU)
in Turkish Republic of Northern Cyprus (TRNC). We can divide them into the
different categories such as nationality, gender, age groups, and etc.. A collection of
data is called data set, and a single observation in the data is called a data point.
People can gather data from past records, or by observation. Again people can use
data on the past to make decisions about the future. So data plays very important role
in decision making.
Most of the times it is not possible to gather data for the population. So what
statisticians do is to gather the data from a sample. They use this information to make
inferences about the population that the sample represents. A population is a whole,
whereas a sample is only a fraction of the population. Assume that there are currently
1
The concepts on sample and population will be discussed later.
1
10,300 students in EMU, and we want to evaluate the expectations and findings of
EMU students toward the University. It will be very hard to consider all the students
in the university, so we select a fraction of the total number. If we decide to take 15%
of the total number, the selected number of students would be 1,545 in this case; and
this number is called Sample Size. On the other hand, total number of students
(10,300) is called Population Size.
The collection of sample or population can be implemented randomly or not
randomly. When data is selected randomly all the observations have an equal chance
of being included in the data regardless of their characteristics. But when data is not
selected randomly, there is a biased selection regarding any characteristic of the
observations.
In order to use data for any purpose efficiently, we need to arrange the data.
This arrangement might be in various numbers of forms. Data before it is arranged
and analyzed is called raw data. It is still unprocessed by statistical methods.
Data Array The first form of arranging the data is to use Data Array. It is one of the
simplest ways to present a data, and it arranges the data in ascending or descending
order.
Table 11.1
Grades of
Students
Raw
Data
88
78
15
65
55
76
30
64
45
100
96
17
47
32
33
68
Data
Ascending
Array
Descending
15
17
30
32
33
45
47
55
64
65
68
76
78
88
96
100
100
96
88
78
76
68
65
64
55
47
45
33
32
30
17
15
When we use data array, we can immediately see the lowest and highest
values in the data, we can divide it into the sections, and we see if a value appear
more than once in the data. But when we have large quantities of data, data array is
not so helpful for us. We need to arrange the data by using another method.
Frequency Distributions The second form of arranging the data is to use frequency
distributions. It is the one of best known types of data in statistics. It divides the data
into the classes with lower and upper limits, and it shows the number of observations
that fall into each of the classes. We can also express the frequency of each value in
terms of fractions or percentages of the total observations, which is called relative
2
frequency distribution. In table 11.2 you can see the frequency distribution and
relative frequency distribution table.
Table 11.2
Frequency
Distribution
of Student
Grades
Class
0 – 25
26 – 50
51 – 75
76 - 100
Total
Frequency
2
5
4
5
16
Relative
Frequency
0.13
0.31
0.25
0.31
1.00
As you will notice from the table, the summation of relative frequency in each
of the classes is equal to 1.00, or 100%. It can never exceed 1.00. Because they are the
results of the division of the frequency of each class by the total.
The classes in a frequency distribution are all-inclusive. All the data fit into
one category or another. And the classes are mutually exclusive2.
The frequency distributions can be qualitative-quantitative, open ended-closed
ended and discrete or continious. We can classify the data according to quantitative
characteristics as age groups, salary, income level, and etc.. Or we can classify the
data according to qualitative characteristics as sex, occupation, nationality, and etc..
On the other hand, we can arrange the data with open ended or closed ended classes.
The classification scheme in open-ended classes is limitless. The last class in the
frequency distribution is open-ended. Lastly, the classes in the frequency distribution
can be discrete or continuous. Discrete data include those entities which are separate
and do not progress from one class to another without a break (eg. 1, 2, 5, 10, 100,
etc..). On the other hand, continuous data include those continuous numbers which do
progress from one class to another without a break (eg. 1.1, 1.2, 22.5, 110.56, etc..).
You can see various types of frequency distributions below:
Table 11.3 Types of Frequency Distribution Tables
2
Quantitative
and Discrete
Data with
Open- Ended
Class
Income level
($)
0 - 500
500 – 1000
1000 TOTAL
Frequency
Qualitative
Data
Gender
Frequency
Male
Female
TOTAL
20
30
50
15
25
10
50
No data point can fall into more than one category.
3
Relative
Frequency
0.30
0.50
0.20
1.00
Relative
Frequency
0.40
0.60
1.00
(a)
(b)
Continuous Data
with ClosedEnded Classes
Student
GPAs
1.00-1.99
2.00-2.99
3.00-4.00
TOTAL
Frequency
100
250
150
500
Relative
Frequency
0.20
0.50
0.30
1.00
(c)
Cross-Tabulations And the third form of arranging data is to use “Cross-Tabulations”
which is a two-way table representing two data with two separate characteristics with
row and column dimensions. Consider table 11.4 (a) for the distribution of income
level with respect to gender. On row dimension gender, and on column dimension
income level is included. Table 11.4 (b) shows the same two-way distribution table of
Income level with respect to gender both in absolute numbers and relative frequencies
or in percentages.
Table 11.4 (a)
Cross-Tabulation Gender
of Income level
with respect Male
to
Gender Female
Row
Income Level ($)
0-500 500-1000
1000- Total
Column
Total
Table 11.4 (b)
Cross-Tabulation Gender
of Income level
with respect Male
to
Gender
7
9
4
20
8
16
6
30
15
25
10
50
Row
Income Level ($)
0-500 500-1000
1000- Total
Female
Column
Total
7
35.0
46.7
14.0
9
45.0
36.0
18.0
4
20.0
40.0
8.0
20
40.0
8
26.7
53.3
16.0
16
53.3
64.0
32.0
6
20.0
60.0
12.0
30
60.0
15
30.0
25
50.0
10
20.0
50
100.0
Interpretation of these two-way tables is essential in statistics, especially in
scientific researches and even in decision making. On the base of Table 11.4 (b), for
example, sample size is 50; 35% of males in this sample have a income level between
0 and $500 and this corresponds to 7 persons in total number of males which is 20.
4
And 46.7 percent of those persons who have an income level between 0 and $500 are
consisted of male which corresponds to 7 persons in total number of those having
income level between 0 and $500 which is 15 persons. Lastly, 14% is the fraction of
males having an income level between 0 and $500 out of the total sample size of 50.
Total number of males (20) constitutes 40% out of the sample size (n=50) and total
number of those having an income level between 0 and $500 constitutes 30% out of
the sample size.
For large number of data it is very hard and time consuming to organize and
arrange data with frequency distributions or cross tabulations, nowadays, by using
computer packages, especially SPSS (Statistical Package for Social Sciences), it has
been very easy to create these types of tables. Later on, we will study on these
subjects in the following chapters.
11.3
Using Graphs to Describe Distributions
We can represent the distribution of a data (especially frequency distribution)
in various forms of graphs. We have usually two dimensions in graphs for
distributions: X and Y. On X-axis values or characteristics of variables are included,
and on Y-axis the frequency of these variables are included in absolute or relative
terms. However, graphs with relative frequencies are more useful because they attract
more attention from the reader, they are easier to understand, to make decision, and
etc.. Nowadays, there are advanced computer packages that are effective for drawing
the graphs. We will discuss these subjects in later chapters. Figure 11.1 includes a few
examples to the types of graphs available in Microsoft Excel ’97 for Windows.
5
Figure 11.1
Types of Graphs
a) Column bar Graph
b) Line Graph
(c) Pie Charts
(d) XY Scatter Graphs
c) Pie Charts
d) XY (scatter) Graphs
6
11.4
Measures of Central Tendency and Dispersion
After data have been collected and tabulated, analysis begins with the
calculation of single numbers, which will summarize or represent all the data called
summary statistics. We use summary statistics to describe the characteristics of a data
set. Nowadays, almost every statistical package program provides summary statistics
for the data in computer output.
Two of the summary statistics are important for decision-makers: Central
Tendency and Dispersion. Before we get into the details of these two concepts, let’s
shortly define them:
Central Tendency Because data often exhibit a cluster or central point, this number is
called a measure of central tendency. It refers to the central or middle point of a
distribution. We can also name Measures of Central Tendency as Measures of
Location. We can show the concept of central tendency in a graph:
Figure 11.2
Curve A
Curve C
Central
Tendency for
Three Types
of
Distribution
Curve
B
It seems clearly from the figure that the central locations of A and C curves
are equal to each other, and central location of curve B lies to the right of those curve
A and curve C.
Dispersion Dispersion refers to the spread of the data in a distribution. Notice
in Figure 11.2 that Curve B has a widest spread, or dispersion than A and C. And
Curve C has a wider spread than Curve A.
Besides Central Tendency and Dispersion, an investigator may benefit from
two other measures in a data set-skewness and kurtosis.
Skewness A Curve of any distribution may be either symmetrical or skewed.
In symmetrical curves, the area is divided into two equal parts by the vertical line
drawn from the peak of the curve in the horizontal axis. For example, we know that
total of a relative frequency distribution is equal to 1.00. And in a symmetrical curve
we will have 50% of the data on the left-hand side of the symmetric curve, and
another 50% of the data on the right-hand side.
7
Figure 11.3
Symmetrical
Curve
50%
50%
On the other hand, curves A and B in Figure 11.4 are skewed curves. Their
frequency distributions are concentrated at either the low end or the high end of the
measuring scale on the horizontal axis. Curve A is called to be Positively Skewed, and
curve B is called to be Negatively Skewed curves.
Figure 11.4
Positively
and
Negatively
Skewed
Curves
Curve A:
Skewed to the right
Curve B:
Skewed to the left
Kurtosis Kurtosis is the peakedness of a distribution. Notice in figure 11.5 that two
curve possesses the same central location, dispersion, and both are symmetrical. But
Curve A is said to be more peaked than curve B.
Figure 11.5
Curve B
Curve A
Measure of
Degree of
Kurtosis
8
Measures of Central Tendency
In statistics, arithmetic mean, weighted mean, geometric mean, median and
mode are referred as the measures of central tendency. We will firstly consider.
The Arithmetic mean
The arithmetic mean is simple average of a data set. We can calculate the
average age in a class, average monthly expenditure of students in EMU, average
tourist number coming to TRNC each year, and etc..
The arithmetic mean for population is represented by the symbol of µ and for
sample is x . The formulas for µ and x are provided below:
Population: µ =
Sample: x =
∑X
∑x
n
N
where N represents population size
where n represents sample size
Table 11.4 provides the ages of students in a class. In this case, we will
assume that data represents a sample derived from the whole university.
Table 11.4
Ages of
Students
in a
Class
ID
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Name
Ali
Veli
Ayla
George
Mohammed
Asher
Samah
Ayse
Mahmut
John
Age
25
24
23
24
22
26
25
27
26
28
Now, let’s calculate the arithmetic mean for this ungrouped data:
x=
∑ x = 25 + 24 + 23 + 24 + 22 + 26 + 25 + 27 + 26 + 28 = 250
n
10
10
So the arithmetic mean of the ages in the class will be; x = 25
But what about if the data is grouped! In a grouped data, we do not know the
separate values of each observation. So we are only able to estimate the mean. But in
ungrouped data, since we know all the observations in the data, whatever mean we
find from the data will be the actual number.
9
To calculate the arithmetic mean for a grouped data we use the following
formula:
∑ ( f × x)
x=
n
where
• x = sample mean
• ∑ = summation
• f = number of observations in each class
• x = midpoint of each class
• n = sample size
Let’s look at the following frequency distribution of student GPAs which is a
grouped data at the same time.
Table 11.5
Frequency
Distribution
of Student
GPAs
Student
GPAs
1.00-1.99
2.00-2.99
3.00-4.00
TOTAL
Frequency
100
250
150
500
Relative
Frequency
0.20
0.50
0.30
1.00
The first step in calculating the arithmetic mean is to find the midpoint (x)
corresponding to each class. To find the midpoints, we add the lower limit of the first
class with the lower limit of the following class and divide it by two. For example, to
find the midpoint for the first class, the formula would be (1.00+2.00)/2 = 1.5. This
process will continue until we reach last class interval. Then, we multiply each
midpoint for the corresponding absolute frequencies and add them up. And lastly, we
divide this summation by the total number of observations in the data. This exercise is
included in table 11.6
Table 11.6 Student
Arithmetic
Mean for
Student GPAs
x=
Frequency
GPAs
1.00-1.99
2.00-2.99
3.00-4.00
TOTAL
100
250
150
500
∑ ( f × x ) = 1,300 = 2.6
Midpoint
(x)
× 1.50
× 2.50
× 3.50
f×x
150
625
525
1,300
n
500
So our approximated or estimated mean for the student GPAs from the
grouped data is 2.6. A useful practice about the midpoints is that to come up with
whole cents and for easy calculation, rounding the numbers is advisable.
10
Today, we get these frequency distributions as ready done by using statistical
packages and computers calculate the arithmetic mean from the original data. So the
arithmetic mean for grouped data would be unnecessary in this situation.
The arithmetic mean is the best known and most frequently used measure of
central tendency. One of the most important uses of the arithmetic mean is that we can
easily make a comparison with different data.
The arithmetic mean has got two important disadvantages. Firstly, it is
affected by extreme values. Secondly, it is not possible to calculate the mean for a
grouped data.
The Median
In its simplest meaning, median divides the distribution into two equal parts. It
is a single number, which represent the most central item, or middlemost item in the
distribution or in the data. Half of the data lie below this number, and another half of
the data lie above this number.
In order to calculate the median for ungrouped data, firstly, we array the data
in ascending or descending order. If we have an odd number of data, then the median
would be the most central item in the data. Let’s consider the following simple data in
table 11.7:
Year
Table 11.7
No of
Graduated
Students in Students
Each Year
1991
1992
1993
1994
1995
10
15
13
14
17
Firstly, let’s array the data in ascending order:
10,
13,
14,
15,
17
In this case, the most central item for this odd-numbered data would be 14,
which is the median of this data set at the same time.
Another way of finding the median is to use the following formula:
 n + 1  th
Median is the 
 item in the data array and n represents number of items
 2 
in the data. If we apply this formula for the above data;
 5 +1

= 3  th item in the data which corresponds to 14.
Median is the 
 2

However, this formula is frequently used for even-numbered data, which takes
the average of the two middle items in the data.
11
In order to calculate the median of even-numbered data, we need to take the
average of the two middlemost items since we do not know the most central item in
the data set. So we should use the above formula to calculate the median. Now let’s
extend table 11.7 to 1996 and try to calculate the median for the data. In this case,
number of observations will be 6 (1991-1996).
Year
Table 11.7
No of
Graduated
Students in Students
Each Year
1991
1992
1993
1994
1995
1996
10
15
13
14
17
21
Again we have to sort the data in ascending order;
10,
13,
14,
15,
17,
21
 6 +1
From the formula, median is 
 = 3.5 th item in the data which is
 2 
 14 + 15 
included between 14 and 15. And the average of 14 and 15 is 
 = 14.5 . That
 2 
is the median of this data set. So the median number of graduated students for the
period of 1991-96 is 14.5.
For a grouped data, we have to find an estimated value for median that can fall
into a class interval. Because we do not know all the observations in the data, we are
only given the frequency distribution with class intervals. The formula to calculate the
median from the grouped data is given below:
~ = L +  (n + 1) / 2 − (F + 1)  w
m


f


~
m = the median assumed for the sample distribution
where
•
•
•
•
•
L = the lower limit of the class interval containing median
F = the cumulative sum of the frequencies up to, but not including,
median class
f = the frequency of the class interval containing median
w = the width of the class interval containing median
n = total number of observations in the data
~ would be replaced by Md and n
In case where we work with the population, m
by N.
12
Let’s consider table 11.5 in the previous examples, and try to find the median
for this data:
Table 11.8
Finding
Median for
Student
GPAs
Student GPAs
1.00-1.99
2.00-2.99
3.00-4.00
TOTAL
Frequency
100
250 Median class
150
500
The first step is to find the class interval that includes median. The median would be
 500 + 1 
th

 = 250.5 item in the data. Secondly, we have to find in which class interval
 2 
the 250.5th item is included. To do that, we add all the frequencies together from the
very beginning until we reach the summation of 250.5. And then we stop. In this data,
the median would fall into the class of (2.00-2.99), because 100+250=350 and we
have already reached 250.5. So the median class is (2.00-2.99).
Now if we put the values into the formula;
~ = 2.00 +  (500 + 1) / 2 − (100 + 1)  × 1.00 = 2.598
m


250


So the median value for the GPAs of the students is ≈ 2.60. And it is an
estimated sample median, since the data is a grouped data.
Unlike the mean, the median is not affected by the extreme values in the data.
It can be calculated even for open-ended grouped data- unless the median falls into
this open-ended class.
The Mode
The mode is the value or observation that occurs most frequently in the data. If
two or more distinct observations occur with equal frequencies, but none with greater
frequency, the set of observations may be said not to have a mode or to be bimodal,
with modes at the two most frequent observations, or trimodal, with modes at the
three most frequent observations. But when there is a single value which is repeating
mostly , the distribution is unimodal.
In order to find the mode of any ungrouped data, we need to array the data
again in ascending or descending order. Let’s consider the following ungrouped data,
which represents the final exam marks of 35 students in a class.
Table 11.9
Student
Marks
In Final
Exam
Marks Arrayed in Ascending Order
10
21 35 60
79
89
12
23 42 65
81
90
16
23 48 67
83
91
19
23 50 68
85
93
20
30 56 76
87
94
13
96
97
98
99
99
It clearly appears that the most frequently repeated observation or student
mark is 23, it is repeated 3 times, so the mode for this ungrouped data is 23. So this
distribution is unimodal. And as we can observe from the data, 99 is repeated 2 times.
Now, let’s consider the following table again for student marks:
Table 11.10
Student
Marks
In Final
Exam
Marks Arrayed in Ascending Order
10
21 35 60
79
89
12
23 42 65
81
90
16
23 48 67
83
91
19
23 50 68
83
93
20
30 56 76
83
94
96
97
98
99
99
This time we changed the observations. And in this case we have two
observations which are repeating mostly, 23 and 83. They are repeated 3 times. And
the mode for this data is 23 and 83, which is called bimodal.
And lastly if we have 3 most repeated observation in a data, the distribution is
trimodal. Let’s make one more change in the previous table:
Table 11.9
Student
Marks
In Final
Exam
Marks Arrayed in Ascending Order
10
21 35 60
79
90
96
12
23 42 65
81
90
97
16
23 48 67
83
90
98
19
23 50 68
83
93
99
20
30 56 76
83
94 100
This time we have got three observations, which are, repeated most; 23, 83
and 90. They are again repeated 3 times each.
However, generally accepted rule is that when we have two or more
observations in a distribution, repeating mostly, this distribution is shortly bimodal.
When we have a grouped data, we assume that the mode is located in the class
interval having the highest frequency. This class interval is called modal class. In
order to find the mode from the grouped data, we use the following formula:


f m − fb
M 0 = LM 0 + 
.w
 ( f m − f b ) + ( f m − f a )
where M0 = the mode of the frequency distribution or grouped data
•
•
•
•
•
Lm0 = lower limit of the modal class
fm = frequency of the modal class
fb = frequency of the class interval below the modal class
fa = frequency of the class interval above the modal class
w = the width of the modal class
14
Let’s apply this formula to find the mode for the following frequency
distribution of student GPAs:
Table 11.8
Finding
Median for
Student
GPAs
Student GPAs Frequency
1.00-1.99
100
2.00-2.99
250
3.00-4.00
150
TOTAL
500
Modal Class
As we can see from the table the modal class for this frequency distribution
will be 2.00-2.99 since it has the highest frequency. Now we can put the values into
the formula:


250 − 100
M 0 = 2.00 + 
 × 1.00 = 2.60
 (250 − 100) + (250 − 150) 
So the mode for this frequency distribution will be 2.60. And since this data is
a grouped data, and we do not know every observation in the data, 2.60 is the
estimated number for the mode.
Like the median, and unlike the mean, the mode is not affected by the extreme
values in the data. And we can use it even with the open-ended class intervals.
Comparison of the Mean, the Median, and the Mode
Among these three measures of central tendency, the mean is the most popular
and useable one. The mean and the median is more preferable according to the mode.
Most of the times, the data may not contain a mode, because no values may occur
more than once in the data. But the frequency of the use of these three measures
depends on the conditions and the area of the research that they will be applied in.
On the other hand, we can compare these measures of central tendency with
respect to statistical methods. When any distribution is symmetrical, the mean, the
median and the mode are equal to each other. Figure 11.6 shows this relationship:
Figure 11.6
Mean,
Median, and
Mode in
symmetrical
distribution
Mean
Median
Mode
15
In this case, there will not be any preference since they are equal to each other.
But what about when we have a skewed distribution! Figure 11.7 shows the position
of these three measures of central tendency when the distribution is skewed to the
right and to the left:
Figure 11.7
Mean, Median
and Mode in
skewed
distributions
Curve A:
Skewed to the right
Mode
Mean
Median
Curve B:
Skewed to the left
Mean
Mode
Median
When the distribution is skewed, the median would be preferable measure of
central tendency, because it is included between the mean and the mode in positively
and negatively skewed distributions.
Measures of Dispersion
When we compare two or more distributions by using the measures of central
tendency, we may be satisfied. We need o know more information about these
distributions; for example, knowing the mean of the data sets may not be enough to
compare them. The variability or dispersion is a useful measure to get more
information about these distributions. If we try to compare two data sets by finding
only the mean of these data, this will not be enough. We may need to know about
which distribution is more consistent compared to other, so the measure of dispersion
will help us in this case since it measures the spread of the observations in the data
around their mean. If the dispersion of the data decreases, the consistency and the
reliability of the data will increase. And the central location (mean, median, or mode)
of the data will be more representative of the data as a whole.
The concept of dispersion plays an important role in our business life also. For
example, a financial manager may concern with the earnings of the firms. Widely
dispersed earnings will indicate a higher risk for a financial manager. Because the
earnings are widely variable, let’s say around the mean, and this indicates
inconsistency in the earnings.
Figure 11.8 shows the spread of three curves having the same mean. Although
they have the same central location, curve A has the least spread compared to B and
C. And curve C has the widest spread in the graph. So distribution of Curve A is said
to be more consistent and reliable compared to B and C.
16
Figure 11.8
Curve A
Curve B
Measure of
Dispersion
for three
curves
having the
same mean
Curve C
Mean
A, B, C
Range, Interfractile Range and Interquartile Range
These are the first and distance measures of dispersion. Range is the difference
between the highest and the lowest values in a data set. We can show it by
Range = Highest value – Lowest value
Interfractile range is the difference between two fractiles. Generally, fractiles
are comprised of 4 characteristics as provided below:
Third Fractiles
Quartiles
Deciles
Percentiles
= divide the data into 3 equal parts
= divide the data into 4 equal parts
= divide the data into 10 equal parts
= divide the data into 100 equal parts
Let’s consider the following data on student grades:
52
72
55
69
•
35
69
38
66
48
38
51
35
46
37
49
34
43
55
46
52
40
52
43
49
61
50
64
47
49
31
52
28
57
41
60
38
58
60
61
57
65
45
68
42
46
41
49
38
As a first example, let’s divide the data into thirds and find the interfractile range
between 1/3 and 2/3 fractiles.
Firstly, how do we organize the data in three equal parts? We will first order the data
starting from the lowest to the highest. And then specify the extensions. We specify
the extentions as follows:
Since the sample size, n, is 48, we divide 48 by 3, and we get 16.
48/3 = 16
Which means that we will have 3 rows and 16 columns which is in 3 × 16 format.
Let’s create it now:
17
1
28
45
55
1
2
3
2
31
45
55
3
34
46
57
4
35
46
57
5
35
47
58
6
37
48
60
7
38
49
60
8
38
49
61
9
38
49
61
10
38
49
64
So, 1/3 fractile will be = 43, 2/3 fractile = 52,
11
40
50
65
12
41
51
66
13
41
52
68
14
42
52
69
15
43
52
69
16
43
52
72
3/3 fractile = 72
Interfractile range between 1/3 and 2/3 fractiles will be then;
52 - 43 = 9.
As a second example, what is the interfractile range between 30th and 70th
percentiles?
•
30th fractile is 30% of 48 = 14.4 = 14th element in the data
70th fractile is 70% of 48 = 33.6 = 34th element in the data
14th element in the data = 42
34th element in the data = 55 So; 55 - 42 = 13 is the interfractile range.
•
As a third example, let’s find the interquartile range which is the difference
between the first and third quartiles, and quartiles divide the data into 4 equal
parts. Let’s divide the data into 4 equal parts.
48/4 = 12,
1
2
3
4
1
28
41
49
58
So;
1th quartile
2nd quartile
so format will be 4 × 12.
2
31
42
49
60
3
34
43
50
60
= 1/4 = 41
= 2/4 = 49
4
35
43
51
61
5
35
45
52
61
6
37
45
52
64
7
38
46
52
65
3rd quartile
4th quartile
8
38
46
52
66
9
38
47
55
68
10
38
48
55
69
11
40
49
57
69
12
41
49
57
72
= 3/4 = 57
= 4/4 = 72
And; Interquartile Range = Q3 - Q1 = 57 - 41 = 16
•
•
But if we want to find the range between 1/4 and 2/4 fractiles, then it will be 4941= 9.
And again 30th and 70th percentiles are the same values as in the previous example
as 42 and 55. And range is 55-42 = 13. No matter we arrange the data into 4 or 3
or any other equal parts, the percentiles are the same values.
18
Variance and Standard Deviation
Variance and, especially, standard deviation are the most commonly used
statistical measures for dispersion. They specify the average distance of an
observation in a data from the mean of the data. We can specify the average distance
(or deviation) of the observations from the mean by the following formula:
Average deviation =
∑(X
i
− µ)
N
Where
Xi
µ
N
= Observations in the population
= Population mean
= Population size
But when we use this formula, we will see that the sum of the deviations is
equal to zero. And as a result, the average deviation will be also equal to zero. To
prevent this problem, we square each deviation to find the standard deviation.
The standard deviation is the square root of the variance. It is more applicable
than the variance in statistical analyses. The reason behind this is that the variance
does not express the average dispersion in the original units; it expresses in squared
units. So, in order to bypass this problem, we take the square root of the variance to
transform into standard deviation. So, the standard deviation measures the average
dispersion in the data in original units of measurement. We can express the variance
and the standard deviation for population by the following formulas:
σ2 =
∑ (X
− µ)
2
i
σ = σ2 =
N
∑ (X
− µ)
2
i
N
Standard Deviation
Variance
However, most of the times, it is not possible to know all the observations in
the population. So we induce our population formula into sampling units. To calculate
the standard deviation of a given sample, we use;
∑ (x − x )
2
s=
i
n −1
Where
xI
x
n-1
= each sample unit in the distribution
= sample mean
= sample size minus 1
What is the reason for using n-1? We can prove that if we select many
different samples from a population, find the standard deviation for each sample, take
19
the average of them, then this average will not tend to be equal to population standard
deviation. So in order prevent this difference, we use n-1 as a denominator.
Now let’s calculate the standard deviation for student CGPAs for a randomly
selected sample of 15 students.
TABLE 11.9
Calculating
Variance
And
Standard
Deviation
For
Ungrouped
Sample Data
Of
Student
CGPAs
x = 3.21
s=
(x- x )2
0.060
0.118
0.038
0.125
0.402
0.047
0.181
0.015
0.099
0.002
0.035
0.009
0.002
0.050
1.245
x- x
0.24
0.34
-0.20
0.35
0.63
-0.22
-0.43
0.12
0.31
0.04
-0.19
-0.10
-0.05
0.22
-1.12
CGPA
3.45
3.55
3.01
3.56
3.84
2.99
2.78
3.33
3.52
3.25
3.02
3.11
3.16
3.43
2.09
Sum =
0.00
2.43
= 0.17357 = 0.41655
15 − 1
Sum =
2.43
sample standard deviation
The standard deviation (s) of this sample of student CGPAs is
approximately 0.42 showing that each observation in the sample, on average, deviates
from the mean ( x = 3.21) by 0.42 both downwards and upwards. The variance of this
sample (s2) is 0.17. As you will observe from table 11.9, the sum of deviations of each
observation in the data is equal to zero. So we square each deviation and add them up.
Calculating Variance and Standard Deviation by Using Grouped Data
Up to this point, we have discussed about the variance and the standard
deviation for ungrouped data, which were unprocessed and raw data. But how about if
the data is grouped? Then we need to use different formula to find variance and
standard deviation. Since the standard deviation (σ) is the square root of the variance
(σ2), we will just work on standard deviation. The formula of standard deviation for a
grouped data is;
∑ f (x − µ )
2
σ = σ =
2
i
for population
N
∑ f (x − x )
2
s = s2 =
i
for sample
n −1
20
In this case, xI in each formula represents the midpoints of each class interval,
and f, the frequency of each class.
Table 11.6 Student
Arithmetic
Mean for
Student GPAs
Frequency
GPAs
1.00-1.99
2.00-2.99
3.00-4.00
0
3
12
Total =
Midpoint
(x)
1.50
2.50
3.50
xI - µ
-1.80
-0.80
0.20
15
(xI - µ)2
f. (xI - µ)2
3.24
0.64
0.04
0
1.92
0.48
Total =
2.40
x = 3.30
x=
∑ f .x = (0 × 1.5) + (3 × 2.5) × (12 × 3.5) = 49.5 = 3.30
s=
n
15
15
2.40
= 0.17 = 0.414039
15 − 1
Now it’s time to mention an important thing! Since we do not know every
single observation in a grouped data, we find midpoints for each class to make
approximation of real observations. We multiply each squared deviation of midpoints
from the mean by their corresponding frequency, add them up, and divide by N (if
population) or n-1 (if sample). So the standard deviation, or the variance computed
from a grouped data is an approximated or estimated value. However, in an
ungrouped data, we know every single observation, and whatever we calculate from
an ungrouped data is a real value.
A Relative Measure of Dispersion: The Coefficient of Variation
The standard deviation and the variance are absolute measures of dispersion.
On the other hand, CV is a relative measure of dispersion that expresses the standard
deviation as a percentage of the mean. By using CV, we can easily compare the
dispersions of two sets of data in percentages. The formula for calculating CV is;
CV =
σ
(100)
µ
for population
CV =
s
(100)
x
for sample
Let’s consider the following example to understand the use of CV better:
Suppose that the common stocks of Sabanci Inc. sold at an average of $50,000
per stock and a standard deviation with $5,000 for the period of 1990-1996. On the
other hand, Koc Inc. had sold its common stocks at an average of $ 60,000 per stock
and a standard deviation with $ 5,800 between 1990-1996. The CV for both firms
will be;
21
CVSabanci =
s
5000
(100) = 0.10 = 10%
=
x 50000
CV Koc =
s
5800
=
= 0.0966 = 9.66%
x 60000
On the base of above results, since Sabanci Inc. has less absolute variation (or
standard deviation, s= $5,000) in its common stocks than Koc Inc., it has more
relative variation than Koc Inc.. This is because of the significant difference in their
means.
11.5 Statistical Inference: Estimation and Hypothesis Tests
Statistical inference, estimation and hypothesis testing are three important
concepts in statistics, and closely related to each other. The definition of statistical
inference had been given in the opening chapter, 1. It deals with uncertainty by using
probability concepts in decision making. It is based on estimation and is the subject of
both estimation and hypothesis testing. In this part, firstly, we will start with
estimation.
Estimation
When we deal with uncertainty, we have to estimate about something. In
statistics, in order to estimate the population parameters, we use sample statistics.
Generally, there are two types of estimates in statistics: A point estimate and an
interval estimate. A point estimate is a single value or a sample statistic, which is used
to estimate an unknown population parameter; it does not provide us enough
information since it is a single number. On the other hand, an interval estimate is a
range of values that is used to estimate the population parameter that it can fall into
this range.
The sample statistics used to estimate the population parameters are called
estimators. For example, x is the sample mean and the estimator of population mean,
µ.; s is the sample standard deviation and the estimator of the population standard
deviation, σ.
The observed values for the estimators are called estimates. For example,
x =23; in this case x is an estimator and 23 is the estimate for the true population
mean.
An Alternative Way For Hypothesis Tests: Using Prob Values (p-values)
Recall that α is a predetermined value for the significance level, which is the
probability of rejecting a true null hypothesis, called type I error. Selecting the level
of α depends on the researcher’s desire. Generally accepted rule for specifying the
level of α is that of the trade off between α and β (probability of type II error).
22
If the cost of making type I error is relatively high or expensive for the
researcher, then he or she will not desire to make type I error and he or she is going to
select a low level of α. On the other hand, if the cost of making type II error is
relatively high or expensive for the researcher, he or she will not desire to make type
II error and is going to select a high level of α.
On the other hand, the standardized value of the probability for rejecting a true
null hypothesis is called a prob value (p-value). It is directly found from z-table by
using z formula. Let’s consider the following example:
Ho : µ = 15
Ha : µ ≠ 15
And;
σ = 2.1
n = 20
x = 13.6
In this simple example, we are given the two-tailed hypothesis whether the
mean value of any population is equal to 15 or not. Besides we are provided
population standard deviation, sample size selected for the test, and sample mean. In
this case the probability of µ > 15 or µ < 15 would be called its prob value, that is
accepting the alternative hypothesis. So the prob value will be the summation of the
the probabilities in both rejection tails. Let’s find the prob value now:
Firstly, we have to find the standard error for the mean:
σx =
σ
n
=
2.1
= 0.47
20
The next step is to find the z score for x :
z=
x−µ
σx
=
13.6 − 15
= −2.98
0.47
Figure 11.9
Prob Values
in normal
curve
0.0028
0.0028
0.4986
0.4986
z
-2.98
0
+2.98
In this example, the p-value for the test would be 2(0.0028)= 0.0056. So the
standardized probability of accepting the alternative hypothesis is 0.56%.
23
Now let’s continue to test our hypothesis. Let’s select a significance level of
α=0.05. Figure 11.10 shows how α and p-values are used together to test the
hypothesis.
Figure 11.10
Use of Prob
Values in
Testing
Hypothesis
0.0028
0.0028
z
-2.98
0
Reject Ho
+2.98
Accept Ho
-1.96
Reject Ho
+1.96
Z critical values
As you will see from figure 11.10, p-values falls outside region of the z critical
values so we would reject Ho and accept Ha that the true mean value for the
population will be beyond (not equal to) 15.
It is possible to derive one more conclusion from above discussions. The pvalue for the above example is 0.0056 and is lower than α=0.05. So when;
p-value > α
p-value < α
then we accept Ho,
then we reject Ho and accept Ha
This conclusion is commonly true not only for two-tailed tests but also onetailed tests.
11.6 Chi-Square And Analysis Of Variance (ANOVA) Tests
Chi-square and ANOVA tests are two statistical techniques used in hypothesis
testing. Usually, we use Chi-square as a test of independency between two or more
variables and goodness of fit of a particular probability distribution, and ANOVA as a
test of difference between two or more population proportions. Let’s consider these
tests in more details.
24
Chi-Square Test for Independency
Two way tables (cross-tabulations) plays an important role in considering and
evaluating chi-square test. When we get the computer output, especially in SPSS, if
we will conduct a hypothesis test by using chi-square, we give the command to SPSS
and chi-square statistic, df, and the significance level is provided with table. To carry
out a Chi-square test we need to find the computed value for Chi-square statistic (χ2)
first. The formula for χ2;
χ2 =
∑
( f0 − fe )
fe
where
χ2
f0
fe
: chi-square statistic
: observed frequency in the distribution
: expected frequency in the distribution
But how do we find fe ? The following formula is used to calculate fe :
fe =
(rt × ct )
n
where
rt
ct
n
: row total of the corresponding frequency cell
: column total of the corresponding frequency cell
: total number of observations (sample size)
Secondly we need to determine a significance level3 for the hypothesis test.
This might be 0.05, 0.10, whose level is up to the researcher’s desire.
And lastly we need to find the table value for the Chi-square statistic. To do
that we find the degrees of freedom (df) by using the following formula:
df = (r-1).(c-1)
Where
df
r
c
: degrees of freedom
: number of rows in the table
: number of columns in the table
So we can find the table value of the chi-square statistic from chi-square
distribution table by looking at df and significance level. If the null hypothesis is true,
the sampling distribution of a χ2 can be approximated by a continuous curve, which is
3
Remember that the significance level shows the level of error accepted in hypothesis test.
25
known as chi-square distribution. There is different chi-square distribution for each
level of df. The degree of freedom increases as column and/or row dimensions, and/or
number of variables in the test increases. As df increases, the chi-square distribution
will be more symmetrical and with small df, it will be skewed to the right as you can
observe from Figure 11.11.
1 df
Figure 11.11
5 df
Representing
Chi-square
distribution
with different
levels of
degrees of
freedom
10 df
0
2
4
6
8
10
12
14
χ2
Carrying A Hypothesis Test by Using Chi-Square
Figure 11.12 is a representative graph for chi-square distribution used in
hypothesis testing. The shaded area on the right tail shows the significance level,
which was the level of error accepted for the true null hypothesis and it shows the
probability of rejecting the true null hypothesis at the same time. The left-hand side
contains the confidence level for the null hypothesis, and shows the probability of
accepting the true null hypothesis.
Figure 11.12
Table value for χ2
Representative
graph of χ2
distribution for
hypothesis test
α= 0.10
C.L. = 0.90
Acceptance Region
Rejection Region
χ2
The intersection point of acceptance and rejection areas corresponds to the
table value for chi-square statistic. If the computed value of chi-square statistic falls to
26
acceptance region (or if computed value is less than the table value), the null
hypothesis will be accepted, otherwise it will be rejected and the alternative
hypothesis will be accepted.
In order to understand better let’s try to solve a problem on chi-square test of
independency. Because of the aim of this book, we will mostly work on computer
based outputs in these types of problems. However, the reader can refer to any
statistics book to see the theoretical computation of the formulas. Table 11.4 shows
the evaluation of teaching ability of lecturers by faculty.
The frequency in bold characters in each cell in Table 11.4 represents the
expected frequency of each observed frequency. Recall from the fe formula that rt in
table 11.4 is equal to 10 for 1st, 2nd and 3rd rows, and 20 for the 4th row; and ct is equal
to 4 for the 1st column, 8 for the 2nd column, 23 for the 3rd column 5 for the 4th column
and 0 for the 5th column.
Each row or column total gives the proportion of each row or column
variables in the total number of observations. For example, the rt for the 1st row is 10,
it shows the total number of B&E students out of n (=50). Its proportion out of n is
0.20 (20%), which is 10 / 50. The ct for the 2nd column is 18, it shows the total
number of student who have found teaching ability of lecturers as High. Its proportion
out of n is 0.36 (36%), which is 18 / 50.
The combined proportion of rt and ct out of n will give the expected
(rt × ct ) .
frequency, fe, for each cell, which is
n
Now let’s continue with our exercise.
27
High Medium Poor
Very
Ability Very
High
Poor
by
Evaluation Faculty
3
6
1
of teaching B&E
ability of
0.8
3.6
4.6
10.0 30.0
60.0
lecturers
25.0 16.7
26.1
by faculty
2.0
6.0
12.0
5
A&S
5
3.6
4.6
50.0
50.0
27.8
21.7
10.0
10.0
4
ENG
6
4.6
1
60.0
40.0
26.1
80.0
12.0
8.0
OTHER
10
6
1
3
1.6
7.2
9.2
2
15.0 50.0
30.0
5.0
75.0 55.6
26.1
20.0
6.0 20.0
12.0
2.0
Column
4
18
23
5
Total
8.0 36.0
46.0
10.0
Table 11.4
Computed value for
Pearson’s
Chi-square Statistic (χ2)
21.70833
Row
Total
10
20.0
10
20.0
10
20.0
20
40.0
50
100.0
Significance
Level
Df
12
0.00985
The null and the alternative hypotheses for chi-square test of this exercise will
be as below:
Ho = Teaching ability of lecturers are independent of faculty
Ha = Teaching ability of lecturers depends on faculty
In chi-square test, the null hypothesis specifies the independence, and the
alternative hypothesis specifies dependence.
The computed value for chi-square statistic is 21.70833 and it is called
Pearson’s Chi-square Statistic. The degrees of freedom is (4-1)(5-1) = 12. And the
significance level is 0.00985.
Let’s test the hypothesis at 0.01 level of significance (α = 0.01). Once we get
this information on this exercise, the next step is to find the table value for χ2.
28
Appendix table ….. provides us the chi-square distribution table for different levels of
α and df. The table value for our exercise will be;
χ 2 0.01,12 = 26.217
Now let’s represent these data in a graph:
Figure 11.13
Hypothesis test
for the
evaluation of
teaching ability
of lecturers by
faculty
21.70833
α= 0.01
Acceptance Region
26.217
Rejection Region
χ2
As you see in figure 11.13, the computed value of χ2 is less than the table
value and it falls within the acceptance region. So we accept our null hypothesis that
the teaching ability of lecturers are independent of faculty according to this data of
n=50 observations.
On the other hand, the p-value for the Pearson's Chi-Square Statistic is
0.00985 and since p-value = 0.00985 < α = 0.01, then again we would reject our null
hypothesis.
Analysis of Variance (ANOVA) Test for Difference
Analysis of variance (ANOVA) is used to test for any differences among more
than two sample means. That is, ANOVA is used to compare two different estimates
of the variance, σ2 , of the same population: the first estimate is among the samples,
and the second estimate is within the samples. If the null hypothesis is true, then both
estimates should be equal.
ANOVA is represented by F-ratio which is used to compare two estimates of
variances. The formula to find the computed F ratio would be;
F=
estimate of the variance among the sample means
estimate of the variance within the sample means
29
It is possible to formularize this in this way;
F=
σˆ
2
∑ n (x
=
j
− x)
2
j
k −1
−1  2
n

 s j
σˆ 2 = ∑  j
 nT − k 
where;
nj
= size of the jth sample
= sample mean of the jth sample
xj
x
k
s 2j
= mean (average) of the sample means (grand mean)
= number of samples
= variance of the jth sample
nT
= total of the sample sizes ( ∑ n j )
Let's consider an example for the test for difference. Suppose we want to test
if there is a significant difference in salaries of males and females in a questionnaire
study for a corporation. The sample size selected is 475.
The salaries and gender are categorized in the questionnaire form as;
SAL:
1. 0 - $50,000
2. $50,000 - $100,000
3. $100,000 - ……..
GENDER;
1. Male
2. Female
We can formularize our hypothesis as
H0 = µmale = µfemale (Salaries of employees do not differ in gender)
Ha = µmale ≠ µfemale (Salaries of employees differ in gender)
α = 0.01
Below is the SPSS output for ANOVA test for employee data.
ANOVA
Sum of
Squares
SAL
Between
Groups
Within
Groups
Total
Mean
Square
df
6.753
1
6.753
64.380
472
.136
71.133
473
30
F
49.510
Sig.
.000
In order to test our null hypothesis we have to compare F-computed value with
F-table value. We will find F-table value with regarding degrees of freedom (df). In
ANOVA test, there are two degrees of freedom.
Df in the numerator of F-ratio
Df in the denominator of F-ratio
= (k -1)
=2-1=1
= ∑ (n j − 1) = nT − k = 475 - 2 = 473
Where;
k
= number of samples
nj
= mean of jth sample
nT
= total sample size
Then, F-table value would be;
1
F472
(α=0.01) = 6.63 (approximately)
Now let's test our hypothesis at α=0.01:
Figure 11.14
Hypothesis test
for the
difference of
salaries among
males and
females
49.510
α= 0.01
6.63
Acceptance Region
Rejection Area
F-computed : 49.510 > F-table : 6.63. So, since F-computed falls within the
rejection area, we would reject our null hypothesis and accept the alternative
hypothesis that in the corporation the salaries of employees differ among males and
females.
Alternatively, p-value= 0.000 < α=0.01 and again we would reject the
hypothesis.
The shape of F-distribution in different levels of degrees of freedom can be
shown in Figure 11.14. The first number in each parenthesis shows df in the
numerator in F-Ratio formula, and the second number shows df in the denominator in
F-Ratio formula.
31
(25,25) df
Figure 11.15
(5,5) df
Representing F
distribution
with different
levels of
degrees of
freedom
(2,1) df
F distribution
The figure above shows that as df in both numerator and denominator parts of
F Ratio formula increases, the shape of F distribution is more likely to approach the
shape of normal distribution.
11.7 Correlation, Simple and Multiple Regression
Correlation and regression analyses are used to determine the nature and the
strength of a relationship between two variables, let's say X and Y. In regression
analysis, one of the variables is the independent variable and another is the dependent
variable. However, in correlation analysis this is not the case. The number of
independent variables can be increased, whereas dependent variable can be only the
one. There is a causal relationship between dependent and independent variable(s).
The independent variables always cause dependent variable to change.
We usually expect a direct (or positive) or an inverse (or negative) relationship
between two variables in correlation and regression analyses. Figure 11.16 shows
these relationships better:
y
y
Figure 11.16
Direct and
inverse
relationship
between X and
Y
(a) Direct Relationship
x
32
(b) Inverse Relationship
x
The curve in Figure 11.16 (a) has a positive slope when there is a direct
relationship between X and Y, whereas the curve in figure 11.16 (b) has a negative
slope when there is an inverse relationship between two variables. Usually, the
independent variable (X) is included on x -axis and the dependent variable (Y) is
included on y-axis.
The relationship between two variables including all the data points from
observed data could be well represented in scatter diagrams. When the relationship
between two variables are described by a straight line, then we say that there is a
linear relationship between two variables. However, there might be some deviations
from the straight line. When the relationship between two variables take a form of a
curve, this relationship is called curvilinear. These types of relationships explained
above can be represented by graphs shown in figure 11.17:
y
y
y
x
x
(a) Direct Linear
(b) Inverse Linear
y
(c) Direct Curvilinear
y
y
x
x
(d) Inverse Curvilinear
Figure 11.17
x
(e) Inverse Linear
more scattered
x
(f) No relationship
Types of relationships between x and y in scatter diagrams
The last graph in section (f) is extremely scattered which indicates no
relationship between x and y variables. The more scattered the data points around the
straight line are, the less relationship between two variables.
Correlation Analysis
Correlation analysis is used to determine the power of relationship between
any two variables. In statistical theory, there are two measures to describe correlation
analysis: the correlation coefficient (R) and the coefficient of determination (R2),
which clarifies the degree linear of relationship between two variables.
33
The correlation coefficient shows the degree of linear relationship between
any two variables. On the other hand, the coefficient of determination is the square of
correlation coefficient especially used in regression analysis, which indicates how the
changes in dependent variable can be explained in terms of the changes in
independent variable(s).
The correlation coefficient (R) takes the values between 0 and ± 1. It can
never exceed 1. If As R approaches 1, the degree of relationship between two
variables increases. If the value of R is positive, there is a direct relationship; if R is
negative, then there is an inverse relationship between two variables. If it is exactly 0,
there is no relationship; if it is 1, then there is perfect correlation (relationship)
between two variables. The following equation represents the correlation coefficient
formula between x and y variables:
Rxy =
∑ (x − x )( y − y )
∑ (x − x ) ∑ ( y − y )
2
2
The coefficient of determination (R2) is the square of correlation coefficient
(R), so the formula for coefficient of determination would be;
R2 = (R)2
The coefficient of determination can take the values between 0 and 1. If R2 is
0, then there is no relationship between dependent variable and independent
variable(s) in regression analysis; if it is 1, then there is perfect determination among
dependent and independent variable(s).
Simple Regression Analysis
Remember that scattered diagrams had showed us how data points were
scattered around the linear straight line. In this section, we will try to calculate this
regression line.
In simple regression analysis, there is one dependent and one independent
variable. Our simple formula to calculate this regression line would be;
Y = a + bX
Where;
Y
: dependent variable
a
: y-intercept of the regression line
b
: slope of the regression line
X
: independent variable
34
The dependent variable (Y) is determined by the independent variable (X).
This formula is used to see how X determines Y variable. But we can also use this
formula to make estimation for Y values. Then we would use the following formula
instead:
Yˆ = a + bX
where Ŷ is the estimated value of Y.
35
Download