Uploaded by Mohamed El-Zarka

Statistical analysis of (ungrouped and grouped data)

advertisement
Higher Institute of Engineering & Technology, El-Boheira
‫المعهد العالي للهندسة والتكنولوجيا بالبحيرة‬
Computer Engineering Department
Research paper submitted in fulfillment of
Mathematics 4
BA221
Statistical analysis of (ungrouped and grouped data)
Expectation on the future data by trend line
Name: Mohamed Yosry Mohamed El-Zarka
Code: 19100
Year: first level
‫ محمد يسري محمد الزرقا‬:‫اسم الطالب‬
‫ هندسة الحاسبات‬:‫القسم‬
‫ المستوى األول‬:‫الفرقة‬
‫ عبد الفتاح أبو هاشم‬:‫دكتور‬
Statistical analysis of (ungrouped and grouped data)
Expectation on the future data by trend line
Introduction
Data analysis is a method in
which data is collected and
organized so that one can
derive helpful information
from it. In other words, the
main purpose of data analysis
is to look at what the data is
trying to tell us. For example,
what does the data show or do?
What does the data not show or
do?
Data Analysis is the act of
trying to learn something from a dataset. Data Analysis is not an end to itself; it is used
in service of optimizing or improving other activities.
Data processing: Data initially obtained must be processed or organized for analysis.
For instance, these may involve placing data into rows and columns in a table format
(i.e., structured data) for further analysis, such as within a spreadsheet or statistical
software. [2]
1
Grouped data VS. Ungrouped data
- In statistics, the term data is used to refer to information that has been collected and
recorded for the purpose of specific projects and it could be either qualitative or
quantitative.
- Both grouped and ungrouped data are types of data however, grouped data has been
classified into categories based on similar characteristics whereas ungrouped data is
raw data.
- Both types of data can be represented by frequency tables. However, for ungrouped
data, there are no class limits thus the use of tally marks. Grouped data in a frequency
table has limits and that is the upper class limit and lower class limit.
- Both types of data can be used to calculate the mean, mode and median of samples of
population therefore they are useful. [1]
Differences
Grouped
Ungrouped
Classification
Organized into classes
No form of organization
Preference
Preferred when analyzing data
Preferred when collecting data
Accuracy
Has higher accuracy levels when
calculating mean and median
Less accurate in determining
mean and median
Presentation
Frequency tables are mostly used
Lists are used in this data type
Summary
Summarized in frequency
distribution
No form of summarization
2
Ungrouped data
Is a collection of statistical data that is classified, but is otherwise uncategorized,
unfiltered and unsorted. In other words, the data is described generally, but has not been
subdivided into groups or categories, and which consists of all the data collected with
none of it omitted. It is also presented in the original order in which it was collected. [9]
Some of the advantages of ungrouped data are as follows:
1. Most people can easily interpret it.
2. When the sample size is small, it is easy to calculate the mean, mode and
median.
3. It does not require technical expertise to analyze it. [7]
Arithmetic mean value
The mean is the average of the numbers. It is easy to calculate: add up all the numbers,
then divide by how many numbers there are. In other words, it is the sum divided by
the count. [3]
𝑛
1
𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 = 𝑥̅ = ∑ 𝑥𝑖
𝑛
𝑖=1
Geometric mean value
The geometric mean, sometimes referred to as geometric average of a set of numerical
values, as the arithmetic mean is a type of average, a measure of central tendency.
The Geometric Mean is a special type of average where we multiply the numbers
together and then take a square root (for two numbers), cube root (for three numbers),
and nth root (for n numbers).
Due to the formula used to calculate it, all values in the dataset must have the same
sign, that is, they must be all positive or all negative. In addition, if the data set contains
a zero, the geometric mean will always be zero. [4]
𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 = 𝑛√𝑥1 ∗ 𝑥2 ∗ 𝑥3 … 𝑥𝑛
3
Deviation
Deviation is a measure of difference between the observed value of a variable and some
other value, often that variable's mean. The sign of the deviation reports the direction
of that difference (the deviation is positive when the observed value exceeds the
reference value). The magnitude of the value indicates the size of the difference. [9]
𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑥𝑖 − 𝑥̅
Variance
The Variance is defined as the average of the squared differences from the Mean.
Variance is non-negative because the squares are positive or zero. The variance of a
constant is zero.
The variance of the distribution is the square of the standard deviation. It is not a useful
measure in its own right, but it is a step in calculating a standard deviation.
It is useful when creating statistical models since low variance can be a sign that you
are over-fitting your data. [3]
𝑛
1
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝜎 2 = ∑(𝑥𝑖 − 𝑥̅ )2
𝑛
𝑖=1
Standard deviation
Standard deviation defined as the square root of the average of the squared deviations
of the values from their average or simply the square root of variance.
The Standard Deviation is a measure of how spread out numbers are.
Standard deviation is one measure of spread. A smaller standard deviation means that
your data is more concentrated around the mean. A larger standard deviation means that
your data tends to be more spread out from the mean.
Why do we use standard deviation, when we have variance? Because, in order to
maintain the calculations in same units i.e. suppose mean is in m/s, then variance is in
m2/s2, whereas standard deviation is in m/s, so we use standard deviation most.
Unlike variance, standard deviation is much more intuitive and closer to the values of
the original data set. Therefore, it is used more often for demographic analysis or in
sample surveys to get an idea of what is normal in the population. [4]
𝑛
1
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝜎 = √ ∑(𝑥𝑖 − 𝑥̅ )2
𝑛
𝑖=1
4
Mode
The mode is simply the most common value.
Although there will only be one mean and median in a set of data, it is possible to have
more than one mode. A set of data with two modes is considered “bimodal,” one with
three, “trimodal” etc.
A big advantage of statistical mode is that it is not restricted to numbers alone. For
example, among all the letters of the English alphabet, the mode is the letter ‘E’, which
is the most frequently encountered letter. However, we cannot define the median or
mean letter, since these can only be defined for numbers. This makes the scope of the
mode quite broad in nature. [6]
Median
Median is when you take all the scores and arrange them in order from low to high then
select the middle number.
The median is one measure of central tendency. If
one orders the elements from lowest to highest. The
median is simply the point where 50% of the data is
above and 50% is below. It is a good, intuitive
metric of centrality that is good at representing a
"typical" or "middle" value. If there are an even
number of elements, it is the mean of the two middle
numbers. [6]
Odd
Even
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝑋𝑛+1
2
𝑀𝑒𝑑𝑖𝑎𝑛 =
𝑋𝑛 + 𝑋𝑛+1
2
2
2
Range
Range is defined simply as the difference between the maximum and minimum
observations. It is intuitively obvious why we define range in statistics this way - range
should suggest how diversely spread out the values are, and by computing the
difference between the maximum and minimum values, we can get an estimate of the
spread of the data.
The range can sometimes be misleading when there are extremely high or low values.
This limitation of range is to be expected primarily because range is computed taking
only two data points into consideration. Thus, it cannot give a very good estimate of
how the overall data behaves. [5]
𝑅𝑎𝑛𝑔𝑒 = 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
5
Example 1 (ungrouped data): For the following data: {100, 120, 120, 60, 100,
90, 140, 120, 80, 150} evaluate the arithmetic mean, the geometric mean, Standard
deviation and median, mode and range.
xi
xi - x̅
(xi − x̅)2
100
-8
64
120
12
144
120
12
144
60
-48
2304
100
-8
64
90
-18
324
140
32
1024
120
12
144
80
-28
784
150
42
1764
∑(xi − x̅)2 = 6760
∑ 𝑥𝑖 = 1080
𝑛
1
1080
𝑥̅ = ∑ 𝑥𝑖 =
= 108
𝑛
10
𝑖=1
𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 = 𝑛√𝑥1 ∗ 𝑥2 ∗ 𝑥3 … 𝑥𝑛
10
= √100 ∗ 120 ∗ 120 ∗ 60 ∗ 100 ∗ 90 ∗ 140 ∗ 120 ∗ 80 ∗ 150
= 104.6
𝑛
1
676
𝜎 = ∑(𝑥𝑖 − 𝑥̅ )2 =
= 676
𝑛
10
2
𝑖=1
𝜎 = √676 = 26
To get the median, you must sort the elements ascending or descending
{ 60,80,90,100,100,120,120,120,140,150 }
𝑀𝑒𝑑𝑖𝑎𝑛 =
𝑋𝑛 + 𝑋𝑛+1
2
2
2
=
𝑋5 + 𝑋6 100 + 120
=
= 110
2
2
𝑀𝑜𝑑𝑒 = 120
𝑅𝑎𝑛𝑔𝑒 = 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 = 150 − 60 = 90
6
7
Grouped data
Grouped data is the type of data, which is classified into groups after collection. The
raw data is categorized into various groups and a table is created. The primary purpose
of the table is to show the data points occurring in each group. For instance, when a test
is done, the results are the data in this scenario and there are many ways to group this
data. For example, the number of students that scored above each 20 mark can be
recorded.
Alternatively, the grades can be used. For example, a 90-100 all the way to F 0-59 with
each category showing how many students are in each category. Histograms and
frequency table are best used to show and interpret grouped data.
Grouping of data has the following advantages:
- Helps in improving the efficiency of estimations.
- Allows for greater balancing of statistical power of tests of the differences between
strata by analyzing equal number from strata.
- Irrelevant subpopulations are ignored while the significant ones are focused on. [1]
Class limits
The (integer) lower and upper limits or lowest and highest values that can belong to
each class.
Grouped data is data that has been organized into groups known as classes. Grouped
data has been 'classified' and thus some level of data analysis has taken place, which
means that the data is no longer raw.
A data class is group of data, which is related by some user-defined property. For
example, if you were collecting the ages of the people you met as you walked down the
street, you could group them into classes as those in their teens, twenties, thirties, and
forties and so on. Each of those groups is called a class.
Each of those classes is of a certain width and this is referred to as the Class Interval or
Class Size. This class interval is very important when it comes to drawing Histograms
and Frequency diagrams. All the classes may have the same class size or they may have
different classes’ sizes depending on how you group your data. The class interval is
always a whole number.
Note: The lower value of a class interval is called lower limit and upper value of that
class interval is called the upper limit. Thus, each class interval has lower and upper
limits. [7]
8
Frequency
Frequency is how often something occurs. By counting frequencies, we can make a
Frequency Distribution table. [3]
The midpoints
Of the intervals are computed by adding the two apparent limits together and dividing
by two. The midpoint for the interval 33 to35 would thus be (33 + 35)/2 or 34. The
midpoint for the second interval (36-38) would be 37. [7]
Weight
The weight of each interval is calculated by multiplying the midpoint of this interval
by its frequency
Weight = midpoint * frequency
Arithmetic mean value
Find the midpoint of the grouped data and then multiply with frequency to get the total
of fi * xi. Divide it with n; you get the mean value for the grouped data.
𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 = 𝑥̅ =
𝑛
1
∑𝑛𝑖=1 𝑓𝑖
∗ ∑ 𝑥𝑖 ∙ 𝑓𝑖
𝑖=1
Deviation
Deviation in grouped data is a measure of difference between the midpoint of an
interval and the mean. The sign of the deviation reports the direction of that difference
(the deviation is positive when the observed value exceeds the reference value). The
magnitude of the value indicates the size of the difference.
Standard deviation
Is similar to the ungrouped data unless every class of data must be multiplied by the
frequency in order to consider the weight of every set of data. [3]
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝜎 2 =
1
∑𝑛𝑖=1 𝑓𝑖
𝑛
∗ ∑ 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2
𝑖=1
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝜎
9
Mode
First, you should detect which interval is the most frequent (fmax).
f1: the previous frequency of fmax.
f2: the next frequency of fmax.
f1 and f2 are treated as Torques at the
edge of the interval; the balance point
is the mode.
Note that in a data set, there could be
more than one modes or no mode at
all.
ℎ = 𝑆2 − 𝑆1
𝑓1 𝑥 = 𝑓2 (ℎ − 𝑥)
𝑀𝑜𝑑𝑒 = 𝑥𝑚𝑖𝑛 + 𝑥
Cumulative Frequency
Calculating cumulative frequency gives you the sum (or running total) of all the
frequencies up to a certain point in a data set. In other words, the total of a frequency
and all frequencies so far in a frequency distribution. [3]
Quartile
In statistics, a quartile, a type of quantile, is three points that divide sorted data set into
four equal groups (by count of numbers), each representing a fourth of the distributed
sampled population.
There are three quartiles: the first quartile (Q1), the second quartile (Q2), and the third
quartile (Q3).
The first quartile (lower quartile, QL), is equal to the 25th percentile of the data. (Splits
off the lowest 25% of data from the highest 75%)The second (middle) quartile or
median of a data set is equal to the
50th percentile of the data (cuts
data in half) The third quartile,
called upper quartile (QU), is
equal to the 75th percentile of the
data. (Splits off the lowest 75% of
data from highest 25%) [7]
10
How we calculating quartiles?
We sort set of data with n items (numbers) and pick n/4-th item as Q1, n/2-th item as
Q2 and 3n/4-th item as Q3 quartile. If indexes n/4, n/2 or 3n/4 are not integers then we
use interpolation between nearest items.
For example, for n=100 items, the first quartile Q1 is 25th item of ordered data, quartile
Q2 is 50th item and quartile Q3 is 75th item. Zero quartile Q0 would be minimal item
and the fourth quartile Q4 would be the maximum item of data, but these extreme
quartiles are called minimum resp. maximum of set. [7]
Median
First, you should know in which set
∑𝑓
2
is located and L equals the lower limit of the
previous set.
ℎ = 𝑆2 − 𝑆1
𝑁
− 𝑐𝑓
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐿 + [ 2
].ℎ
𝑓
First quadrant
First, you should know in which set
∑𝑓
4
is located and L equals the lower limit of the
previous set.
𝑁
− 𝑐𝑓
𝑄1 = 𝐿 + [ 4
].ℎ
𝑓
Third quadrant
First, you should know in which set
3∑𝑓
4
is located and L equals the lower limit of the
previous set.
3𝑁
− 𝑐𝑓
𝑄3 = 𝐿 + [ 4
].ℎ
𝑓
11
Example 2 (grouped data): For the following grouped data, Evaluate the
arithmetic mean, standard deviation, mode, median, first and third quadrants.
Set
0 - 12 - 24 - 36 - 48 - 60 - 72 - 84 - 96 - 108 Frequency 8 12 15 18 24 16 12
k
8
6
Set
0 12 24 36 48 60 72 84 96 108 -
fi
xi
fixi
8
12
15
18
24
16
12
6
8
6
6
18
30
42
54
66
78
90
102
114
48
216
450
756
1296
1056
936
540
816
684
∑ = 125
𝑥𝑖 − 𝑥̅ (𝑥𝑖 − 𝑥̅ )2
-48.4 2342.56
-36.4 1324.96
-24.4
595.36
-12.4
153.76
-0.4
0.16
11.6
134.56
23.6
556.96
35.6
1267.36
47.6
2265.76
59.6
3552.16
∑ = 6798
30
𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2
cf
18740.5
15899.5
8930.4
2767.68
3.84
2152.96
6683.52
7604.16
18126.1
21313
8
20
35
53
77
93
105
111
119
125
∑ = 102221.66
polygon & curve
25
20
15
10
5
0
0
20
40
60
80
100
120
140
12
𝑚𝑒𝑎𝑛 𝑣𝑎𝑙𝑢𝑒 = 𝑥̅ =
2
𝜎 =
∑𝑛𝑖=1
𝑓𝑖
∗ ∑ 𝑥𝑖 ∙ 𝑓𝑖 =
𝑖=1
𝑛
1
∑𝑛𝑖=1
𝑛
1
𝑓𝑖
∗ ∑ 𝑓𝑖 (𝑥𝑖 − 𝑥̅ )2 =
𝑖=1
6798
= 54.4
125
102221.66
= 817.8
125
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝜎 = 28.6
Mode is at set [48 – 60]
ℎ = 𝑆2 − 𝑆1 = 60 − 48 = 12
𝑓1 𝑥 = 𝑓2 (ℎ − 𝑥)
18𝑥 = 16(12 − 𝑥)
𝑥 = 5.65
𝑀𝑜𝑑𝑒 = 𝑥𝑚𝑖𝑛 + 𝑥 = 48 + 5.65 = 53.65
∑𝑓
2
=
125
2
= 62.5 , so the median is at set [36 – 48] and L = 36
𝑁
− 𝑐𝑓
62.5 − 53
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐿 + [ 2
] . ℎ = 36 + [
] ∗ 12 = 42.333
𝑓
18
∑𝑓
4
=
125
4
= 31.25 , so the first quadrant is at set [12 – 24] and L = 12
𝑁
− 𝑐𝑓
31.25 − 20
𝑄1 = 𝐿 + [ 4
] . ℎ = 12 + [
] ∗ 12 = 23.25
𝑓
12
3∑𝑓
4
=
3∗125
4
= 93.75 , so the third quadrant is at set [60 – 72] and L = 60
3𝑁
− 𝑐𝑓
93.75 − 93
𝑄3 = 𝐿 + [ 4
] . ℎ = 60 + [
] ∗ 12 = 60.5625
𝑓
16
13
Expectation on the future data by trend line
A trend line is a mathematical equation that describes the relationship between two
variables. It is produced from raw data obtained by measurement or testing. The
simplest and most common trend line equations are linear, or straight, lines. Once you
know the trend line equation for the relationship between two variables, you can easily
predict what the value of one variable will be for any given value of the other variable.
You should already have a trend line based on a data set you have taken or gathered
with the line representing a general trend of that data. Then, you can move onto
predictions. [8]
Predicting a Value
Examine your trend line equation to ensure it is in the proper form. The equation for a
linear relationship should look like this: y = mx + b. "x" is the independent variable and
is usually the one you have control over. "y" is the dependent variable that changes in
response to x.
Uses for a Trend line: Trend Lines and Predictions
A trend line is most often used to display data that increases or decreases at a specific
and steady rate (at least within a specific timeline). That means that a trend line is a
great tool for predicting what value something will have in the future; trend lines and
predictions go hand in hand.
Some examples could be for predicting population size, predicting the amount of a
certain molecule in a solution over time, or creating an equation that can then be used
in the future to predict similar information with other data sets. [8]
14
Example 3 (trend line): If the income of a family (in pounds) in 8 successive
months shown in the following table then estimate the forecasted income in September
and October.
Month(x) Jan Feb Mar Apr May Jun Jul Aug
income 400 450 420 500 550 600 580 700
x
y
xy
x2
400
1 400
1
900
2 450
4
3 420 1260
9
4 500 2000 16
5 550 2750 25
6 600 3600 36
7 580 4060 49
8 700 5600 64
36 4200 20570 204
Chart Title
900
800
700
700
600
550
500
600 580
500
450
400
400
420
2
4
300
200
100
0
0
6
8
∑ 𝑦 = 𝑚 ∑ 𝑥 + 𝑁𝑐
∑ 𝑥𝑦 = 𝑚 ∑ 𝑥 2 + 𝑐 ∑ 𝑥
4200 = 𝑚(36) + 8𝑐
9𝑚 + 2𝑐 = 1050
20570 = 𝑚(204) + 𝑐(36)
102𝑚 + 18𝑐 = 10285
m = 39.76
c = 346.07
10
12
14
y=39.76 x+346.07
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑖𝑛 𝑆𝑒𝑝𝑡𝑒𝑚𝑏𝑒𝑟 = 39.76 (9) + 346.07 = 703.91
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑖𝑛 𝑂𝑐𝑡𝑜𝑏𝑒𝑟 = 39.76 (10) + 346.07 = 743.67
15
References
1. http://www.differencebetween.net/language/words-language/differencebetween-grouped-data-and-ungrouped-data/
2. https://en.wikipedia.org/wiki/Data_analysis
3. https://www.mathsisfun.com/
4. https://www.khanacademy.org/
5. https://explorable.com/range-in-statistics
6. https://explorable.com/statistical-mode
7. https://www.wyzant.com/resources/lessons/math/statistics_and_probability/int
roduction/data
8. https://sciencing.com/use-line-equation-predicted-value-7985744.html
9. Jeffery T. Walker, Statistics in criminology and criminal justice: analysis and
interpretation, 1999
16
Download