Descriptive Statistics: organizing, summarizing, graphing and

advertisement
Math 103 Lecture 9 notes page 1
Math 103 Lecture 9 class notes
Statistics – from the Latin staticus – “out of state” is the study of methods of collecting,
organizing, presenting, analyzing, and drawing conclusions about data, commonly in numerical form.
The three branches of statistics are: descriptive, inferential, and survey/sampling.
Descriptive Statistics: organizing, summarizing, graphing and presenting data
I.
Organize data into frequency tables
a. class and frequency
b. extended table includes relative frequency, cumulative frequency, and cumulative
relative frequency, as well as class marks
II.
Make charts or graphs
a. histogram and bar graphs
b. frequency curve or polygon
c. ogive
d. box & whisker or boxplot
e. circle or pie graph
f. stem & leaf
g. pictographs
h. scatter plots
i. pictographs
j. line plots
III.
Calculate measures
a. central tendency (mean, median, mode)
b. variation (range, standard deviation)
c. position (percentiles, quartiles)
I. Organize data into frequency tables
Frequency Table = is an excellent device for making larger collections of data much more
intelligible. A frequency table is so named because it lists categories of scores along with their
corresponding frequencies. The frequency for a category or class is the number of original
scores that fall into that class. The columns of an extended frequency table generate various
graphs or charts. Extended frequency tables therefore become important prerequisites for
creating graphs and charts used in statistics.
Guidelines for frequency tables:
1. Class intervals should not overlap. Classes are mutually exclusive.
2. Classes should continue throughout the distribution with NO gaps. Include all classes.
3. All classes should have the same width.
4. Class widths should be “convenient” numbers.
5. Use 5-20 classes.
6. Make lower or upper limits multiples of the width.
An extended frequency table includes the following:
a. class intervals (lower and upper limits)
b. marks
c. frequency
d. cumulative frequency
Math 103 Lecture 9 notes page 2
e.
f.
relative frequency
cumulative relative frequency
Example Data Set: Dr. Brown’s Exam Scores
98
90
85
84
81
79
76
98
90
85
83
80
79
75
93
88
85
82
80
78
75
93
87
84
82
79
77
74
91
86
84
81
79
77
74
note: Typically, you will have to rank data first; data
73
69
60
72
68
60
71
67
59
70
64
57
70
63
54
does not usually come ordered!
The first thing to do with numerical data is to organize it into a frequency table. Each column of a
frequency table generates (is used to create) a particular graph or chart.
class
freq
Extended Frequency Table of Dr. Brown’s Exam Scores
cumulative
relative
cumulative
mark
freq.
freq.
relative freq.
boundaries
50-54
55-59
60-64
65-69
70-74
75-79
80-84
85-89
90-94
95-99
100+
The width of each class is 5 (size of each class).
The lower limits are the smaller numbers of each class (50, 55, 60, 65, 70, etc.)
The upper limits are the larger numbers of each class (54, 59, 64, 69, 74, etc.)
Note: the class limits (either lower or upper) should be a multiple of the width.
The mark is the midpoint of each class.
Only the last class can be "open-ended."
There should be no "gaps" in organizing classes.
There should be no "overlap" in class numbers.
II. Make charts or graphs
Histogram: a type of bar graph representing an entire set of data. It is helpful when you need to
discover or display the distribution of interval or ratio data. Histograms illustrate central
tendency, shape, and how the data is spread out or dispersed. A histogram is made up of the
following components:
1. a title, which identifies the population of concern
2. a vertical scale, which identifies the frequencies in the various classes
Math 103 Lecture 9 notes page 3
3. a horizontal scale, which identifies the variable. Values for class boundaries, class limits, or
class marks may be labeled along the axis.
Shapes of histograms: symmetrical, uniform, skewed, J-shaped, and bimodal.
Frequency Curve or Polygon: the horizontal axis uses marks. The vertical axis is either frequency or
relative frequency. Several sets of data can be depicted on the same graph.
Ogive: a cumulative frequency curve, always with a typical “upward” trend.
Box-&-Whisker = a representation of the data set by splitting the distribution into four groups of
25%, often referred to as quartile distribution. Several sets of data can be pictures side-by=side
using box-&-whisker plots, making the data comparisons easier for the reader. “key” points are:
1. 0% (or 10%)
2. 25%
3. 50%
4. 75%
5. 100% (or 90%)
III. Calculate Measures
AVERAGES:
Mode = the data value that occurs most frequently.
Ex: 6 7 8 9 9 10
Another ex:
6 3 2 3 3 5 3 2
If you cannot identify the ONE value that occurs most frequently, the data set has no mode.
Ex: 3 3 4 5 5 7
Median = middle score in ranked data.
Ex: 3 4 6 8 9 11 15 27 31
When there is an even number of data values, the median is halfway between the middle scores.
Ex: 3 5 6 7 9 10 10 12
The median need not be a member of the data set.
Midrange = the value halfway between the highest and lowest data value.
Ex: 6 7 8 9 9 10
The midrange need not be a member of the data set.
Midhinge = value halfway between the left hinge and right hinge of a box-&-whisker plot.
The midhinge need not be a member of the data set.
Mean = the value which is the sum of all data values divided by the number of pieces of data.
Ex: 6 3 8 5 3
Mean = (6 + 3 + 8 + 5 + 3)/5 = 5
Ex: 85 76 93 82 96
Mean =
The mean need not be a member of the data set.
The mean is the most common measure of central tendency and is the statistics usually denoted
by the word “average.” The mean is the “balance point” of a distribution, or the sum of the
distances to the right of the mean equals the sum of the distances to the left.
Math 103 Lecture 9 notes page 4
Ex: There is a salary dispute between management and labor at Castellon Manufacturing. The labor
Union claims that the average salary is only $3000/year. Management says the average salary is
$7300. You have been called in as a federal mediator. The first thing you need to do is to figure out
the average salary. Suppose there are only 10 employees and you can get their monthly salaries
from payroll. They are:
$3000, $3000, $3000, $3500, $4000, $4500, $6000, $6000, $1000 and $25000
Does the Unions’ claim of #3000 seem like the “average”?
Does the Management’s claim of $7300 seem like the “average”?
Weighted Mean = Suppose one class of 20 students averaged 80% on a test, while another class of
30 students averaged 74%. What is the average for the combined group of students?
DISPERSION OR VARIATION
Range = the difference or distance between the highest to lowest data value.
Variance, σ = sum of squared deviations divided by the number of data points
Standard Deviation, s = √variance = (x – µ)^2/ n or (x – µ)^2/ (n-1)
Note: for any distribution, the virtual spread (range) of the data is about 6 standard deviations.
Standard deviation is usually rounded 1-2 places.
Ex: data: 1
3 5 6 6 9
s=
POSITION
Quartiles = numbers that divide ranked data into fourths. A data set has 3 quartiles.
1st Quartile = a number such that at most 1/4 of the data are smaller in value, and at most 3/4
are larger.
2nd Quartile = median
3rd Quartile = a number such that at most 3/4 of the data are smaller in value, and at most 1/4
are larger.
Percentiles = numbers that divide ranked data into 100 parts. A data set has 99 percentiles.
Deciles = numbers that divide ranked data into 10 parts. A data set has 9 deciles.
Here’s an example using a small data set, which contains an odd number of values.
35
47
48
50
51
53
54
70
75
Split the data in half, at the median, then find the median of each half.
Interquartile range, IQR, Q3 – Q1 = 54–48 = 6
Here’s an example using a small data set, which contains an even number of values:
35
47
48
50
51
53
54
60
70
75
Split the data in half, at the median, then find the median of each half.
Interquartile range, IQR, Q3 – Q1 = 60–48 = 12
Download