Uploaded by Shahzaib Salman

MEFall2023 2

advertisement
Descrip Stat
Probability and Random Variables
The math, the computation, and examples.
Prof. Dr. Asad Ali
Department of Applied Mathematics and Statistics
Institute of Space Technology
Islamabad, Pakistan
1 / 38
Descrip Stat
Descriptive Statistics
Chapter 2: Descriptive Statistics:
Presentation of Data
14 / 38
Descrip Stat
Descriptive Statistics
Researchers can measure many physical processes, such as pressure, strength, survival time, and amount.
Often, hundreds or thousands of measurements are made, and procedures were developed to organize,
summarize, and make sense of these measurements. These procedures, referred to as descriptive statistics,
are specifically used to condense and summarize numerical observations to get the initial (meaningful)
information and make the data ready for further manipulations. In univariate case, descriptive statistics
mainly covers the following tasks of data analysis.
Presentation of data using
Tabulation methods (frequency distributions)
Graphical methods (diagrams and graphs)
Measures of central tendency (averages and quantiles)
Measures of dispersion (ranges, deviations, variations)
In the multivariate case, descriptive statistics covers, along with the above, the analysis of the relationships (covariance, correlation and regression etc) between different variables as well.
15 / 38
Descrip Stat
Presentation
Tabulation methods
Frequency distribution: The frequency (f ) of a particular observation is the number of times that
observation occurs in the data. A frequency distribution is a table that lists the observations along with
their respective frequencies.
Frequency distribution with no grouping: For discrete data with small range (or small number
of actually distinct values) the frequency table is constructed by arranging the collected data values
in ascending order of magnitude with their corresponding frequencies.
Frequency distribution with grouping: In case of very broad range of values or if the data
is continuous, the entire data is divided into different non-overlapping groups or classes with the
number of observations falling in each group or class.
A frequency distribution condenses bulky data to a small table, which tells us about the pattern and
shape of the distribution of values of the underlying variable or population.
16 / 38
Descrip Stat
Presentation
A very simple example (without grouping)
Example 1.
The marks awarded for an assignment set for a BE (MS&E) class of 20 students were as follows:
6
7
5
7
7
8
7
6
9
7
4
10
6
8
8
9
5
6
4
8.
Present this information in a frequency table.
Solution:
To construct a frequency table, we proceed as following:
Draw a three columns table with column’s heading “Marks”, “Tally”, and “Frequency”. Put all the
possible distant values without repetition in the first column in ascending (or descending) order as
shown below.
Marks
4
5
6
7
8
9
10
Tally
Frequency
17 / 38
Descrip Stat
Presentation
data: 6 7 5 7 7 8 7 6 9 7 4 10 6 8 8 9 5 6 4 8.
The first data value is 6, put a tally bar against it, second is 7 put a tally bar for it too. Go ahead and
put tallies for all the values. Count the bars for each data value and that’s the frequency. When the
number of tally bars equals 5, bundle them in a group of 4 with a slash across it.
Marks
4
5
6
7
8
9
10
Tally
Frequency
=⇒
Marks
4
5
6
7
8
9
10
Tally
Frequency
2
2
4
5
4
2
1
So we now have the data in a meaningful form. We can now answer the following questions?
Where is the data concentration (peak) point?
How is it declining?
Is this a normal marks’ distribution? Or there is some thing wrong with class performance?
Do we need further investigations?
18 / 38
Descrip Stat
Presentation
The “how to” of a frequency distribution with grouping.
When there are too many values in the data and are more spread out, it is difficult to set up a
frequency table for every data value as there will be too many rows in the table. Before proceeding
ahead, we need to learn about a few terms and rules that we will need for the construction of a
frequency distribution with grouping or classes.
Class-limits: The numbers that describe a class or group. The two limits are called lower class
limit and the upper class limit. The class-limits (CL) should be inclusive and should not cause any
overlapping between any adjacent classes, e.g. age in years can be classified as 10-14, 15-19, 20-24 or
10.0-14.9, 15.0-19.9, 20.0-24.9 etc.
Class-boundaries: The class-boundaries (CB) are precise numbers that separate one class from its
first neighbours. CBs are just the midpoint of the upper limit of one class and the lower limit of
the next class, e.g. consider the first two classes 10-14, 15-19, the class boundaries are calculated by
14+15
= 14.5. Thus, for 10-14, 15-19, 20-24, the CBs are 9.5-14.5, 14.5-19.5, 19.5-24.5, thus CBs are
2
by one decimal place more precise than class-limits. The upper class-boundary of one class coincides
with the lower class-boundary of the next class, thus leaving no gap.
Class marks: Class marks are simply the midpoints of classes. For example, the class mark of class
10-14 is 10+14
= 12.
2
Class interval or class width: Class interval, traditionally denoted by “h” is the difference between
the two class-boundaries of the same class or the difference between the lower (or upper) limits of the
two consecutive classes. In the above case the class interval is 5. Ideally, all the classes should have
equal intervals, unequal intervals can also happens, but should be avoided, until required, because of
difficulty in interpretations.
Class frequency: The frequency of a particular class is the number of times the data value occurs
within the limits of that class.
19 / 38
Descrip Stat
Presentation
A typical frequency distribution with grouping looks like the following table.
Classes
10-14
15-19
20-24
···
···
Class-boundaries
9.5-14.5
14.5-19.5
19.5-24.5
···
···
Tally bars
···
···
···
···
···
Class-Marks
10+14
=12
2
17
22
···
···
Frequency
···
···
···
···
···
The columns of class-boundaries and class-marks help in the calculations of different statistical
quantities such as mean, median and quantiles as we will see in next chapter.
20 / 38
Descrip Stat
Presentation
A few rules
How many classes? There is no hard rule to decide as to how many classes should we make. Both
very few or too many classes will defeat the purpose of constructing the frequency distribution. Too
few classes will result in the loss of lot of information and too many classes will kill the purpose of
condensation. As a rule of thumb, a number between 5 and 15 would give reasonable results.
(I think, 15 is still too large; I would not take a number larger than 10, unless I am using a computer.)
Find the range, that is the difference between the maximum and the minimum values in the data.
Calculate the class width/interval “h” by dividing the range of data by the number of classes. If
the division results in a decimal number, take the next higher whole number. Avoid using fractional
numbers as intervals, it brings you headache. Taking a multiple of 5 or 10 would ease up the problem
and also would increase the readability of the table. The resulting classes should cover the whole of
data.
Note: you can also choose a proper interval first and then calaculate the number of classes, provided
the whole data is covered in a reasonable number of classes.
Where to start the first class from? Usually the lower class-limit is put at or below the smallest
data. Remember, the lower class-limit of the first class should never be larger than the smallest value
of the data otherwise that values at the lower end of data will be lost. Starting from a multiple of 5 or
10 would not hurt.
Find the upper class-limit by counting from the lower class-limit to the end of the interval. Note
that adding the interval directly to lower class-limit is erroneous, as we know the classes are inclusive.
Adding an interval to the lower class-limit of a class gives you the lower class-limit of the next class,
rather than the upper limit of the same class. (most students forget it...be careful)
21 / 38
Descrip Stat
Presentation
Find the rest of the classes by just adding the interval to the lower and the upper class-limits to
get the lower and upper class-limits of the next class.
Now the hard part... scanning the data (mouse hunt)... and putting the values in appropriate
classes. Placing tally marks and frequencies. Determine the sum of frequencies to check whether all
the values were included.
An example of frequency distribution with grouping
Example 2.
Thirty energy saver light bulbs were tested to determine how long they usually last. The results, to the
nearest day, were recorded as follows:
423
392
399
369
408
415
387
431
428
411
401
422
393
363
396
394
391
372
371
405
410
377
382
419
389
400
386
409
381
390
Construct a frequency distribution for these values.
Solution:
First we need to find the range
Range = Largest - Smallest = 431 − 363 = 68
Lets there be 8 classes, therefore class interval is
68
Range
=
= 8.5 ≈ 10.0
Number of classes
8
We take h = 10.0 because it eases up the data scanning process.
h=
22 / 38
Descrip Stat
Presentation
Now lets make the table and set the classes. The smallest value is 363, we start from 360 and set the
first class as 360-369, second as 370-379 and so on. Now start scanning the data, allocate the values to
their corresponding classes and put tallies for them accordingly.
When a data value is allocated to some class, cancel that value in the actual data set,
indicating that it has been counted, to avoid recounting.
423
369
392
399
408
415
387
431
428
411
401
422
393
363
396
Classes
360-369
370-379
380-389
390-399
400-409
410-419
420-429
430-439
Total
TBs
394
391
372
371
405
410
377
382
419
389
400
386
409
381
390
Frequency (f )
23 / 38
Descrip Stat
Presentation
Go on scanning, canceling and counting and put the tallies accordingly. Fill up the rest of the columns.
423
369
387
411
393
394
371
377
389
409
392
408
431
401
363
391
405
382
400
381
399
415
428
422
396
372
410
419
386
390
Classes
360-369
370-379
380-389
390-399
400-409
410-419
420-429
430-439
Total
TBs
Frequency (f )
2
3
5
7
5
4
3
1
P
f = n = 30
Sum up the frequencies to check whether all the data values are picked up.
By looking at this frequency distribution, we can quickly find that generally most of the bulbs have life
between 390 and 399 days as this group has the largest frequency (7). Thus, this group can be
regarded as a representative group of this data. We can also see how the frequencies decrease toward
the tails of the distribution and the distribution looks fairly symmetric.
24 / 38
Descrip Stat
Presentation
Relative frequency and percentage frequency
While studying these data we may want to know not only how long the bulbs last, but also what
proportion of the bulbs falls into each class of bulb’s life.
This is called the relative frequency (RF) of a particular observation or class and is found by dividing
its corresponding frequency (f ) by the total number of observations n: that is:
RF =
f
n
A more clear measure is the percentage frequency, which is found by multiplying each relative frequency
value by 100. Thus:
PRF = RF × 100
The PRF tells us about what percent of observations fall in a particular class. This gives us a bit clearer
picture than RF.
25 / 38
Descrip Stat
Presentation
Example 3.
Lets calculate the RF and PRF for Example 2.
Classes
360-369
370-379
380-389
390-399
400-409
410-419
420-429
430-439
Total
f
2
3
5
7
5
4
3
1
P
f = n = 30
f
RF = n
= 0.07
= 0.10
0.17
0.23
0.17
0.13
0.10
0.03
1.0
2
30
3
30
PRF
2
× 100 = 7
30
3
×
100 = 10
30
17
23
17
13
10
3
100
Looking at this table we can now say that:
The chance of any randomly selected bulb having a life in this range is approximately 0.23.
23% of bulbs have a life of from 390 days up to but less than 400 days.
26 / 38
Descrip Stat
Presentation
Cumulative frequency distribution
A cumulative frequency distribution table is the same as a frequency distribution table with
additional columns that give the cumulative frequency (CF) and the cumulative percentage (CP)
of the data.
The cumulative frequency distribution gives us an idea of how many observations of the data falls
below or above a given value. It also tells us about the number of observations that lie between a
given interval of two values.
The CFs are obtained by adding the frequencies of different classes in successive manner to the
cumulative total of previous frequencies, that is accumulating (the running total) the elements of
frequency column.
The accumulation can be conducted either from the top class (or value), in which case the CF is
called the “less than” type CF, or from the bottom class (or value), which is known as the “more
than” type CF.
In grouped data, for the “less than” type CF the upper class boundaries are used and for “more
than” type the lower class boundaries are used.
27 / 38
Descrip Stat
Presentation
Example 4.
We calculate a “less than” type CF and CP for the data in Example 2.
Upper Class Boundaries
<369.5
<379.5
<389.5
<399.5
<409.5
<419.5
<429.5
<439.5
Total
f
2
3
5
7
5
4
3
1
n = 30
CF
2
2+3=5
5+5=10
10+7=17
17+5=22
22+4=26
26+3=29
29+1=30
CF
× 100
n
2
×
100
=7
30
5
× 100 = 17
30
CP =
33
57
73
87
97
100
Suppose we have been asked to find as to how many or what percent of observations lie below 399.5.
From the table we quickly learn that
- there are 17 observations below the given value, which makes them 57% of the entire data.
Note: We use the upper class boundaries for a “less than” (<) type CF distribution.
28 / 38
Descrip Stat
Presentation
Example 5.
Now lets calculate a “more than” type CF and CP for the data in Example 2.
Upper Class Boundaries
>359.5
>369.5
>379.5
>389.5
>399.5
>409.5
>419.5
>429.5
Total
f
2
3
5
7
5
4
3
1
n = 30
CF
28+2=30
25+3=28
20+5=25
13+7=20
8+5=13
4+4=8
1+3=4
1
CP = CF
× 100
n
30
×
100
=
100
30
28
× 100 = 93
30
83
67
43
27
13
1
Suppose now we are asked to tell as to how many or what percent of observations lie above 399.5.
From the table we quickly learn that
- there are 13 observations above the given value, which makes them 43% of the entire data.
Note: We use the lower class boundaries for a “more than” (>) type CF distribution.
29 / 38
Descrip Stat
Presentation
Graphical Methods
We now introduce the widely used graphic displays for data presentation in all sciences. Most of the time
we want visual presentation of data for clearly seeing patterns in data. Patterns in data are commonly
described in terms of: center, spread, shape, and unusual features. Some common distributions have
special descriptive labels, such as: symmetric, bell-shaped, skewed, etc.
We often need answer to questions like
Where are the data (center) located?
How spread out are the data?
Are the data symmetric or skewed?
Are there outliers in the data?
Histogram
Histogram is a visual version of frequency table. The main purpose of a histogram is to enhance the
presentation of data. You can present the same information in a table; however, the graphic presentation
format usually makes it easier to see the nature of distribution. It consists of vertical bars, usually called
‘bins’ or ’frequency bins’, that represent different classes of a frequency table. Usually, there is no
space between adjacent bars. The height of bars indicates the frequency of classes.
A histogram can typically help you answer the following questions:
What is the most frequent observation?
What distribution (center, variation and shape) does the data have?
Does the distribution of data look symmetric or is it skewed towards the left or right?
30 / 38
Descrip Stat
Presentation
Example 6.
Lets construct a histogram and relative frequency histogram for the energy saver bulbs data given in
Example 2. We already have constructed the frequency table in Example 3. Lets now depict it.
Relative Frequency Histogram of Data
0.10
Frequency
0
0.00
1
0.05
2
3
Frequency
4
0.15
5
6
0.20
7
Histogram of Data
360
380
400
Data values
420
440
360
380
400
420
440
Data values
One can also construct a percentage relative frequency histogram by multiplying the relative
frequencies by 100.
31 / 38
Descrip Stat
Presentation
Some of the key features that we usually look for in a histogram.
Center: Graphically, the center of a distribution is
located at the median of the distribution. Median
is the point in a graphic display where about half
of the observations are on either side. In the chart
to the right, the height of each column indicates the
frequency of observations. Here, the observations
are centered over 4.
Spread: The spread of a distribution refers to the variability of the data. If the observations cover a
wide range, the spread is larger. If the observations are more clustered around a single value, the
spread is smaller.
32 / 38
Descrip Stat
Presentation
Shape: The shape of a distribution is described by the following characteristics.
Number of peaks. Distributions can have few or many peaks. Distributions with one clear peak
are called unimodal, and distributions with two clear peaks are called bimodal.
Symmetry. When it is graphed, a unimodal symmetric distribution can be divided at the center
so that each half is a mirror image of the other. A single peaked symmetric distribution is referred
to as bell-shaped distribution.
Skewness. When displayed graphically, some unimodal distributions have many more observations on one side of the graph than the other side. Distributions with most of their observations
on the left (toward lower values) are said to be skewed right; and distributions with most of their
observations on the right (toward higher values) are said to be skewed left.
Uniform. When the observations in a set of data are equally spread across the range of the
distribution, the distribution is called a uniform distribution. A uniform distribution has no clear
peak(s).
Gaps. Gaps refer to areas of a distribution where there are no observations. The second last figure
on the next slide has a gap; there are no observations in that part of the distribution.
Outliers. Sometimes, distributions are characterized by extreme values that differ greatly from the
other observations. These extreme values are called outliers.
33 / 38
Descrip Stat
Presentation
xi
f (x i )
f (x i )
f (x i )
f (x i )
f (x i )
f (x i )
Different shapes of histogram.
xi
xi
xi
xi
xi
xi
f (x i )
f (x i )
f (x i )
f (x i )
A uniform distribution
f (x i )
A skewed distribution
f (x i )
A normal distribution
xi
xi
A bi−modal distribution
A distribution with outliers
Clip−like distribution
xi
xi
xi
34 / 38
Descrip Stat
Presentation
Cumulative Histogram
Like histogram–frequency table pairing the cumulative histogram is a visual version of the
cumulative frequency table. It tells what percentage of the total number of observations accumulates
at each bin (or interval). It makes finding the percentage or proportion of observations falling within
a given interval rather more easy. An ordinary and a cumulative histogram of the same data are
given in the following figures.
Histogram of Data
Cumulative Histogram of Data
30
30
7
7
29
25
6
26
22
3
3
1
1
20
10
5
5
2
17
15
4
4
10
Cumulative Frequency
5
3
2
Frequency
5
5
0
0
2
360
380
400
Data values
420
440
360
380
400
420
440
Data values
Cumulative histogram is the actual concept that most of the probability distributions uses to calculate probabilities associated with different events. So learning about it, and understanding it, is
must.
35 / 38
Descrip Stat
Presentation
Dotplots
A dotplot is an attractive summary of numerical data when the data set is reasonably small or there are
relatively few distinct data values, especially discrete values. Each observation is represented by a dot
above the corresponding location on a horizontal measurement scale. When a value occurs more than
once, there is a dot for each occurrence, and these dots are stacked vertically. As with a stem-and-leaf
display, a dotplot gives information about location, spread, extremes, and gaps.
Example 7.
The study included 33 students whose first-grade IQ scores are given here:
The following figure shows a dotplot for the above data. A representative IQ value is around 110, and
the data is fairly symmetric about the center.
36 / 38
Descrip Stat
Presentation
Stem and Leaf Displays
A stem-and-leaf plot (aka stemplot) of a quantitative variable is a textual graph that classifies
data items according to their most significant numeric digits. It is generally used for small data sets
(50 or fewer observations). A stem and leaf display is similar to a histogram, since it shows how
many values in a set fall under a certain interval. It has even more information, it shows the actual
values within the interval.
A stem is the leading digit of an observation whereas the remaining digits are leaves. For example
the observation 327 can be split as stem=3, and leaf=27 or stem=32, and leaf=7. The stemplot is
drawn with two columns separated by a vertical line with stems listed to the left of the vertical line.
Each stem is listed only once and no numbers are skipped, even if it has no leaves. The leaves are
listed in increasing order in a row to the right of each stem.
When there is a repeated number in the data (such as two 72s) then the plot must reflect such (e.g.
the plot of 72 72 75 76 would look like 7 | 2 2 5 6.)
37 / 38
Descrip Stat
Presentation
Example 8.
The stem-and-leaf plot of energy saver bulb data is constructed as below.
Stem
36
37
38
39
40
41
42
43
Leaves
39
127
12679
0123469
01589
0159
238
1
(Key: 40|8 = 408)
In this example we could also use a stem of single digit but then there would have been only two stems;
3 and 4, resulting in a very less informative plot. In the case of values with decimal points (continuous
data), the decimal part in each number is taken as leaf. Rounding may be used to suppress certain
number of decimal points so that all data values have the same number of decimal points.
Further reading and exercises:
Have a look of the introduction and Section 1.2 of Devore’s book and the examples there in.
Then solve questions 10, 11, 12, 13, 14, 15, 16.a, 16.b, 17, 20, 24, 25, 29 in exercise 1.2.
38 / 38
Download