Continuous variables

advertisement
DISTRIBUTIONS
What is a “distribution”?
An arrangement of
cases in a sample or
population according
to their values or
scores on one or
more variables
One distribution for a continuous
variable. Each youth homicide is a case.
There is one variable: the number of
youth homicide victims each month.
Two distributions, each for a single
continuous variable: violent crimes and
commitments to prison.
Each violent crime is a case. The variable
is their number each year per 100,000
population
Each commitment to prison is a case. The
variable is the number of commitments
each year per 100,000 population
Distributions can be
depicted visually.
How that is done
depends on how
many variables and
their type (whether
categorical or
continuous).
Officer’s disposition
(A case is a single
unit that “contains”
all the variables of
interest)
One distribution for TWO categorical
variables:
Youth’s demeanor (two categories)
Officer disposition (four categories)
Each police encounter with a youth is a
case.
DEPICTING THE DISTRIBUTION OF
CATEGORICAL VARIABLES
Depicting distribution of a categorical
variable: the bar graph
Distributions depict the
frequency (number of
cases) at each value of a
variable. Here there is
one variable with two
values: gender (M/F).
Frequency means the
number of cases –
students – at a single
value of a variable.
Frequencies are always
on the Y axis
Values of the variable are
always on the X axis
Bars are “made up of”
cases. Here that means
students, arranged by the
variable gender
How many at each value/score
Y - axis
A case is a single unit
that “contains” all the
variables of interest.
Here each student is a
case
N = 32
n=17
n=15
Value or score of variable
X - axis
Distributions illustrate how cases cluster or spread out
according to the value or score of the variable. Here
the proportions of men and women seem about equal.
Using a table to display the distribution of
two categorical variables
Officer’s disposition
Value or score of
variable
Value or score of variable
“cells”
Number of cases at each value/score
DEPICTING THE DISTRIBUTION OF
CONTINUOUS VARIABLES
Depicting the distribution of continuous
variables: the histogram
Distributions depict the
frequency (number of
cases) at each value of a
variable. Here there is
one variable: age,
measured on a scale of
20-33.
Values of the variable are
always on the X axis
How many at each value/score
Frequency means the
number of cases –
students – at a single
value of a variable.
Frequencies are always
on the Y axis
Y - axis
A case is a single unit
that “contains” all the
variables of interest.
Here each student is a
case
Trend line
Value or score of variable
What is the area under the trend line
“made up of”? Cases, meaning students
(arranged by age)
X - axis
How many at each value/score
Y - axis
Sometimes, bar graphs are used for
continuous variables
Value or score of variable
X - axis
What are the bars “made of”? Cases, meaning homicides (arranged by the
variable homicides per year)
Continuous variables: What “makes up”
the areas under the trend lines?
How many at each value/score
Cases, that’s what!
Each murdered youth is one “case”
Variable: # youths murdered each month
Trend line
How many at each value/score
Value or score of variable
Trend line
Each violent crime is one “case”
Variable: # crimes per 100,000 population
each year
Trend line
Value or score of variable
Each commitment to prison is one “case”
Variable: # commitments to prison, per 100,000
population, each year
Summarizing the distribution of
CATEGORICAL VARIABLES
Summarizing the distribution of categorical
variables using percentage
•
Instead of using graphs or a lot of words, is there a single statistic that can
convey what a distribution “looks like”?
•
Percentage is a “statistic.” It’s a proportion with a denominator of 100.
•
Percentages are used to summarize categorical data
– 70 percent of students are employed; 60 percent of parolees recidivate
•
Since per cent means per 100, any decimal can be converted to a percentage
by multiplying it by 100 (moving the decimal point two places to the right)
– .20 = .20 X 100 = 20 percent (twenty per hundred)
– .368 = .368 X 100 = 36.8 percent (thirty-six point eight per hundred)
•
When converting, remember that there can be fractions of one percent
– .0020 = .0020 X 100 = .20 percent (two tenths of one percent)
•
To obtain a percentage for a category, divide the number of cases in the
category by the total number of cases in the sample
50,000 persons were asked whether crime is a serious problem: 32,700
said “yes.” What percentage said “yes”?
Using percentages to
compare datasets
•
Percentages are “normalized” numbers (e.g., per 100), so they can be used to
compare datasets of different size
– Last year, 10,000 people were polled. Eight-thousand said crime is a serious
problem
– This year 12,000 people were polled. Nine-thousand said crime is a
serious problem.
Calculate the second percentage and compare it to the first
Class 1
Class 2
Practical exercise
Draw two bar graphs, one for each class,
depicting proportions for gender
Class 1
Class 2
15 Females
•
15/31 = .483 X 100 = 48%
16 Males
•
16/31 = .516 X 100 =_____
52%
100%
20 Females
• 20/31 = .645 X 100 = 65%
11 Males
• 11/31 = .354 X 100 =____
35%
100%
_____
100%
Calculating increases in percentage
Increases in percentage are computed off the base amount
Example: Jail with 120 prisoners. How many prisoners will there be...
–
–
–
–
…with a 100 percent increase?
100 percent of the base amount, 120, is 120
(120 X 100/100)
120 base + 120 increase = 240
(2 times the base amount)
…with a 150 percent increase?
150 percent of 120 is 180 (120 X 150/100)
120 base plus 180 increase = 300
(2½ times the base amount)
How many will there be with a
200 percent increase?
200%
larger
100%
larger
Original
2 times
larger (2X)
3 times
larger (3X)
Percentage changes can mislead
•
Answer to preceding slide – prison with 120 prisoners
200 percent increase
200 percent of 120 is 240 (120 X 200/100)
120 base plus 240 = 360 (3 times the base amount)
•
Percentages can make changes seem large when bases are small
Example: Increase from 1 to 3 convictions is 200 (two-hundred) percent
3-1 = 2
2/base = 2/1 = 2
2 X 100 = 200%
•
Percentages can make changes seem small when bases are large
Example: Increase from 5,000 to 6,000 convictions is 20 (twenty) percent
6,000 - 5,000 = 1,000
1,000/base = 1000/5,000 = .20 = 20%
Summarizing the distribution of
CONTINUOUS VARIABLES
Four summary statistics for
continuous variables
•
•
Continuous variables – review
– Can take on an infinite number of
values (e.g., age, height, weight,
sentence length)
– Precise differences between cases
– Equivalent differences: Distances
between 15-20 years same as 60-70
years
Summary statistics for continuous variables
– Mean: arithmetic average of scores
– Median: midpoint of scores (half
higher, half lower)
– Mode: most frequent score (or scores,
if tied)
– Range: Difference between low and
high scores
3.5
1.3
Summarizing the distribution
of continuous variables - the mean
•
•
•
Arithmetic average of scores
– Add up all the scores
– Divide the result by the number of scores
Example: Compare numbers of arrests for twenty
police precincts during a certain shift
Method: Use mean to summarize arrests at each
precinct, then compare the means
arrests
arrests
Mean 3.0
Variable: number of arrests
Unit of analysis: police precincts
Case: one precinct
Mean 3.5
Issue: Means are pulled in the direction
of extreme scores, possibly misleading
the comparison
Transforming categorical/ordinal variables into
continuous variables, then using the mean
•
•
•
•
•
Ordinal variables are categorical variables
with an inherent order
– Small, medium, large
– Cooperative, uncooperative
Can summarize in the ordinary way:
proportions / percentages
Can also transform them into continuous
variables by assigning categories points on
a scale, then calculating a mean
Not always recommended because
“distances” between points on scale
may not be equal, causing misleading
results
Is the distance between “Admonished” and
“Informal” same as between “Informal and
Citation”? “Citation” and “Arrest”?
Value
Severity of
Disposition
4
Arrested
Youths
Freq.
%
16
24
3
Citation
or official
reprimand
9
14
2
Informal
reprimand
16
24
1
Admonished
& released
25
38
Total (N)
66
100
Severity of disposition mean = 2.24
(25 X 1) + (16 X 2) + (9 X 3) + (16 X 4) / 66
Summarizing the distribution
of continuous variables - the median
•
Median can be used with
continuous or ordinal variables
•
Median is a useful summary
statistic when there are extreme
scores, making the mean
misleading
•
In this example, which is identical
to the preceding page except for
one outlier (16), the mean is 3.5 –
.5 higher
•
But the medians (3.0) are the
same
arrests
0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6
Exercise 1: 2, 3, 5, 5, 8, 12, 17, 19, 21
Exercise 2:
Compute...
2, 3, 5, 5, 8, 12, 17, 19, 21, 21
3+3/2=3
•
Answers to preceding slide
Exercise 1: 2, 3, 5, 5, 8, 12, 17, 19, 21
Answer: 8
Exercise 2: 2, 3, 5, 5, 8, 12, 17, 19, 21, 21
Answer: 10 (8 + 12 / 2)
arrests
0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 16
3+3/2=3
Summarizing the distribution
of continuous variables - the mode
•
Score that occurs most often
(with the greatest frequency)
•
Here the mode is 3
•
Modes are a useful summary
statistic when cases cluster
at particular scores – an
interesting condition that
might otherwise be overlooked
•
Symmetrical distributions, like this
one, are called “normal” distributions. In such
distributions the mean, mode and median are
the same. Near-normal distributions are common.
•
There can be more than one mode (bi-modal, tri-modal, etc.). Identify the modes:
arrests
• Exercise 1: 2, 3, 5, 5, 8, 12, 17, 19, 21
• Exercise 2: 2, 3, 5, 5, 8, 12, 17, 19, 21, 21
A final way to depict the distribution
of continuous variables - the range
•
Answers to preceding side
Exercise 1: 2, 3, 5, 5, 8, 12, 17, 19, 21
Mode = 5 (unimodal)
Exercise 2: 2, 3, 5, 5, 8, 12, 17, 19, 21, 21 Modes = 5, 21 (bimodal)
•
Range: a simple way to convey the distribution of a continuous variable
– Depicts the lowest and highest scores in a distribution
2, 3, 5, 5, 8, 12, 17, 19, 21 – range is “2 to 21”
– Range can also be defined as the difference between the scores
(21-2 = 19). If so, minimum and maximum scores should also be
given.
– Useful to cite range if there are outliers (extreme scores) that
misleadingly distort the shape of the distribution
Case no.
Practical exercise
• Calculate your class summary
statistics for age and height – mean,
median, mode and range
• Pictorially depict the distributions
for age and height, placing the
variables and frequencies on the
correct axes
Next week – Every week:
Without fail – bring an approved calculator – the
same one you will use for the exam.
It must be a basic calculator with a square root key.
NOT a scientific or graphing calculator. NOT a cell
phone, etc.
Case
No.
Income
No. of
arrests
Gender
1
15600
4
M
2
21380
3
F
3
17220
5
F
4
18765
2
M
5
23220
1
F
6
44500
0
M
7
34255
0
F
8
21620
0
F
9
14890
1
M
2. Pictorially depict the distribution of
arrests
10
16650
2
F
11
44500
1
F
12
16730
3
M
3. Pictorially depict the distribution of
gender
13
23980
3
F
14
14005
0
F
15
21550
2
M
16
26780
4
M
17
18050
1
F
18
34500
1
M
19
33785
3
F
20
21450
2
F
HOMEWORK
(link on weekly schedule)
1. Calculate all appropriate summary
statistics for each distribution
Download