Data Analysis: Descriptive Statistics

advertisement
Data Analysis: Descriptive Statistics
• “The government is very keen on
amassing statistics. They will collect
them, raise them to the nth power, take the
cube root, and prepare wonderful
diagrams. But you must never forget that
every one of these figures comes in the
first instance from the village watchman,
who just puts down what he pleases.”
Sir Josiah Stamp
Commissioner of Inland Revenue
(1896-1919)
Statistics
• Science of collecting, describing and
interpreting data
• Types
– Descriptive
– Inferential
Descriptive Statistics
• Techniques that allow you to organize and
summarize data. Examples include graphs,
percentages and averages
– Includes the collection, presentation and
description of sample data
– Descriptive statistics come in a form of charts,
tables and graphs
Inferential
• Techniques that allow you to offer conclusions about
your data
– Use sampling techniques, experimental designs, and
statistical tests to make inferences about your data
– Use observations:
•
•
•
•
Generalize from the sample to the population
Perform hypothesis testing
Determine relationships among variables
Make predictions
– Inferential statistics allow to infer properties of an entire
group (population) of individuals from a small number of
those individuals (sample)
Definitions
• Response variable
– A characteristic of interest about each individual
element of a population or sample
– This is the characteristic being measured. If you want
the income of all teachers in Mankato, your variable is
income
• Data
– The set of values collected for the variable from each
of the elements belonging to the sample. We could
ask 10 teachers (our sample) their income (variable)
and the 10 responses would be our data
Scales of measurement
• Nominal data (naming data)
– Classifies data into mutually exclusive (non
overlapping) exhausting categories in which no order
or rank can be imposed on the data
– No logical ordering of categories
– Categories are qualitative in nature
– Examples: gender; religion; eye color; marital status
Cont’d
• Ordinal (rank order data)
– Classify data into categories that can be ranked,
however precise differences between ranks don’t
exist
– Differences in amount of measured characteristic are
discernible and numbers are assigned according to
that amount
– Properties of ordinal data:
• Data are mutually exclusive
• Data categories have some logical order
• E.g. Results of a 400m race: 1st , 2nd, 3rd
Cont’d
• Discrete Data
– A quantitative variable whose set of possible
values is countable
– Consist of data that are whole numbers and
have no decimal places
– Often thought as counting data
• Number of people in a lecture theatre
• Number of lecture halls on MSU campus
• Number of people who agree with a particular
statement
Cont’d
• Continuous Data
– A variable that can take any real number
• Height
• Weight
• Income
Organizing and Displaying data
• The purpose of displaying data using graphics is
to summarize raw data into an easy to read and
presentable form.
• From such graphs conclusions about the data
can often be drawn without further analysis
• Graphic presentation
– Qualitative data
• Bar Chart
• Pie chart
– Quantitative data
• Frequency distribution and histogram
Bar Chart
Year
Cigarettes
1900
54
1910
151
1920
665
1930
1485
1940
1976
1950
3522
1960
4171
1970
3985
1980
3851
1990
2828
Cont’d
Year
1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
male
30
80
380
825
1100
2000
2300
2213
2200
1600
Female
25
71
290
675
880
1600
1900
1900
1800
1300
Cont’d
Pie chart
Frequency distribution
• A listing that pairs each value of a variable
with its frequency
• They can be classified into two types:
– Ungrouped
• Each value of variable in the distribution stands
alone
– Grouped
• A set of classes are assigned
Ungrouped
• Ungrouped because
for each value of x
(0 to 5) we have the
number of times (f—
its frequency) that
appears in the data
X (variable)
F (frequency)
0
3
1
5
2
8
3
4
4
2
5
1
Grouped
Class No.
Class limits
Frequency
Midpoints
1
50<=x<60
2
55
2
60<=x<70
3
65
3
70<=x<80
8
75
4
80<=x<90
5
85
5
90<=x<100 3
95
Cont’d
• When constructing grouped frequency
distributions, the following points should be
borne in mind
–
–
–
–
Each class should be of the same width
The classes should be exclusive and exhaustive
Open-ended classes should be avoided
The number of classes should ideally be between 5
and 15
– To graph grouped frequency distributions we often
use histograms
– The bars of a histogram should touch as they
represent the area of the same sample
Cont’d
Cont’d
• Relative frequency
– Frequency/total frequency
• Cumulative frequency
– Sum of the frequency of the class intervals as
you go down each interval
Measures of Central Tendency
• The most commonly used characteristic of
a set of data is its center or the point about
which many of the observations are
clustered
• There are many different ways of
measuring central tendency:
– Mean
– Median
– Mode
– Range
Mean
• The arithmetic mean (or the average or
simply mean) is computed by summing all
numbers and dividing by the number of
observations
• The mean uses all the observations and
each observation affects the mean
Median
• The median is the middle value in an ordered array
of observations
• If there is an even number of data in the array, the
median is the average of the two middle numbers
• If there is an odd number of data in the array, the
median is the middle number
• For example, suppose you want to find the median
for the following set of data:
• 74, 66, 69, 68, 73, 70
• First we arrange the data in an ordered array:
• 66, 68, 69, 73, 70, 74
Cont’d
• Since there is an even number of data, the average of
the middle two numbers (i.e. 69 and 73) is the median
(142/2=71)
•
Generally the median provides a better measure of
location than the mean when there are extremely large
or small observations (i.e., when the data are skewed to
the right or to the left
• If the median is less than the mean, the data set is
skewed to the right
• If the median is greater than the mean, the data is
skewed to the left
Mode
• The mode is the most is the most frequent
occurring value in a set of observation
• Put simply, it is the most frequently
occurring data value
• For example, given 2, 3, 4, 5, 4, the mode is 4
because there are more fours than any other
number—unimodal
• Data may have two modes—bimodal
• Observations with more than two modes are
referred to as multimodal
Range
• The range is the simplest measure of
dispersion
• The range can be thought in two ways:
– As a quantity: the difference between the
highest and lowest scores in a distribution
– As an interval: the lowest and highest scores
may be reported as the range
Cont’d
Sample 1
97 98 99 100 101 102 103
Sample 2
49 50 51 100 149 150 151
Sample 3
1
2
3
100 197 198 199
Cont’d
• Range for sample 1: Either (97, 103) or 6
• Range for sample 2: Either (49, 151) or 102
• Range for Sample 3: Either (1, 199) or 198
• Each sample is clearly different from one
another in terms the way the data is spread
• The range is susceptible to extreme values; it
only uses two values in your data for calculation
Cont’d
• The range does not include all of the
observations
• Only the two most extreme values are
included and these two numbers may be
untypical observations
Quartiles
• Quartiles divide the sorted data into quarters.
Hence, for the first quartile (Q1) 25% of the
data is below it and 75% above it
• The second quartile (Q2-this is also the
median) has 50% of the data below it and
50% above it
• Finally, 75% of the observations are below
Q3 while 25% are above
Calculating IQR
• Inter quartile range (IQR)
– Upper quartile minus the lower quartile
• Sort (rank) the data and find the median
(which is the middle value—the 50%
position)
• This effectively splits your data into two
groups—below median and above median
• Next we simply find the median of these two
groups—this gives us the value at the 25%
position and the 75% position
Cont’d
Sample 1
97 98 99 100 101 102 103
Sample 2
49 50 51 100 149 150 151
Sample 3
1
2
3
100 197 198 199
Cont’d
• IQ range for sample 1:
• The median is the 4th largest observation
which is 100
• There are three data points below our median
(97, 98, 99)
• The median of these values is 98
• There are three data points above our
median (101, 102, 103)
• The median of these values is 102
• Hence, our IQ range is 102-98=4
Variance
• Variance is the average of the squared
deviations from the arithmetic mean
• The following steps are used to calculate the
variance
– Find the arithmetic mean
– Find the difference between each observation from
the mean
– Square these differences
– Sum the square differences
– Since the data is a sample, divide the number (from
step 4 above) by the number of observations minus
one.
Download