DISTRIBUTIONS What is a “distribution”? An arrangement of cases in a sample or population according to their values or scores on one or more variables One distribution for a continuous variable. Each youth homicide is a case. There is one variable: the number of youth homicide victims each month. Two distributions, each for a single continuous variable: violent crimes and commitments to prison. Each violent crime is a case. The variable is their number each year per 100,000 population Each commitment to prison is a case. The variable is the number of commitments each year per 100,000 population Distributions can be depicted visually. How that is done depends on how many variables and their type (whether categorical or continuous). Officer’s disposition (A case is a single unit that “contains” all the variables of interest) One distribution for TWO categorical variables: Youth’s demeanor (two categories) Officer disposition (four categories) Each police encounter with a youth is a case. DEPICTING THE DISTRIBUTION OF CATEGORICAL VARIABLES Depicting distribution of a categorical variable: the bar graph Distributions depict the frequency (number of cases) at each value of a variable. Here there is one variable with two values: gender (M/F). Frequency means the number of cases – students – at a single value of a variable. Frequencies are always on the Y axis Values of the variable are always on the X axis Bars are “made up of” cases. Here that means students, arranged by the variable gender How many at each value/score Y - axis A case is a single unit that “contains” all the variables of interest. Here each student is a case N = 32 n=17 n=15 Value or score of variable X - axis Distributions illustrate how cases cluster or spread out according to the value or score of the variable. Here the proportions of men and women seem about equal. Using a table to display the distribution of two categorical variables Officer’s disposition Value or score of variable Value or score of variable “cells” Number of cases at each value/score DEPICTING THE DISTRIBUTION OF CONTINUOUS VARIABLES Depicting the distribution of continuous variables: the histogram Distributions depict the frequency (number of cases) at each value of a variable. Here there is one variable: age, measured on a scale of 20-33. Values of the variable are always on the X axis How many at each value/score Frequency means the number of cases – students – at a single value of a variable. Frequencies are always on the Y axis Y - axis A case is a single unit that “contains” all the variables of interest. Here each student is a case Trend line Value or score of variable What is the area under the trend line “made up of”? Cases, meaning students (arranged by age) X - axis How many at each value/score Y - axis Sometimes, bar graphs are used for continuous variables Value or score of variable X - axis What are the bars “made of”? Cases, meaning homicides (arranged by the variable homicides per year) Continuous variables: What “makes up” the areas under the trend lines? How many at each value/score Cases, that’s what! Each murdered youth is one “case” Variable: # youths murdered each month Trend line How many at each value/score Value or score of variable Trend line Each violent crime is one “case” Variable: # crimes per 100,000 population each year Trend line Value or score of variable Each commitment to prison is one “case” Variable: # commitments to prison, per 100,000 population, each year Summarizing the distribution of CATEGORICAL VARIABLES Summarizing the distribution of categorical variables using percentage • Instead of using graphs or a lot of words, is there a single statistic that can convey what a distribution “looks like”? • Percentage is a “statistic.” It’s a proportion with a denominator of 100. • Percentages are used to summarize categorical data – 70 percent of students are employed; 60 percent of parolees recidivate • Since per cent means per 100, any decimal can be converted to a percentage by multiplying it by 100 (moving the decimal point two places to the right) – .20 = .20 X 100 = 20 percent (twenty per hundred) – .368 = .368 X 100 = 36.8 percent (thirty-six point eight per hundred) • When converting, remember that there can be fractions of one percent – .0020 = .0020 X 100 = .20 percent (two tenths of one percent) • To obtain a percentage for a category, divide the number of cases in the category by the total number of cases in the sample 50,000 persons were asked whether crime is a serious problem: 32,700 said “yes.” What percentage said “yes”? Using percentages to compare datasets • Percentages are “normalized” numbers (e.g., per 100), so they can be used to compare datasets of different size – Last year, 10,000 people were polled. Eight-thousand said crime is a serious problem – This year 12,000 people were polled. Nine-thousand said crime is a serious problem. Calculate the second percentage and compare it to the first Class 1 Class 2 Practical exercise Draw two bar graphs, one for each class, depicting proportions for gender Class 1 Class 2 15 Females • 15/31 = .483 X 100 = 48% 16 Males • 16/31 = .516 X 100 =_____ 52% 100% 20 Females • 20/31 = .645 X 100 = 65% 11 Males • 11/31 = .354 X 100 =____ 35% 100% _____ 100% Calculating increases in percentage Increases in percentage are computed off the base amount Example: Jail with 120 prisoners. How many prisoners will there be... – – – – …with a 100 percent increase? 100 percent of the base amount, 120, is 120 (120 X 100/100) 120 base + 120 increase = 240 (2 times the base amount) …with a 150 percent increase? 150 percent of 120 is 180 (120 X 150/100) 120 base plus 180 increase = 300 (2½ times the base amount) How many will there be with a 200 percent increase? 200% larger 100% larger Original 2 times larger (2X) 3 times larger (3X) Percentage changes can mislead • Answer to preceding slide – prison with 120 prisoners 200 percent increase 200 percent of 120 is 240 (120 X 200/100) 120 base plus 240 = 360 (3 times the base amount) • Percentages can make changes seem large when bases are small Example: Increase from 1 to 3 convictions is 200 (two-hundred) percent 3-1 = 2 2/base = 2/1 = 2 2 X 100 = 200% • Percentages can make changes seem small when bases are large Example: Increase from 5,000 to 6,000 convictions is 20 (twenty) percent 6,000 - 5,000 = 1,000 1,000/base = 1000/5,000 = .20 = 20% Summarizing the distribution of CONTINUOUS VARIABLES Four summary statistics for continuous variables • • Continuous variables – review – Can take on an infinite number of values (e.g., age, height, weight, sentence length) – Precise differences between cases – Equivalent differences: Distances between 15-20 years same as 60-70 years Summary statistics for continuous variables – Mean: arithmetic average of scores – Median: midpoint of scores (half higher, half lower) – Mode: most frequent score (or scores, if tied) – Range: Difference between low and high scores 3.5 1.3 Summarizing the distribution of continuous variables - the mean • • • Arithmetic average of scores – Add up all the scores – Divide the result by the number of scores Example: Compare numbers of arrests for twenty police precincts during a certain shift Method: Use mean to summarize arrests at each precinct, then compare the means arrests arrests Mean 3.0 Variable: number of arrests Unit of analysis: police precincts Case: one precinct Mean 3.5 Issue: Means are pulled in the direction of extreme scores, possibly misleading the comparison Transforming categorical/ordinal variables into continuous variables, then using the mean • • • • • Ordinal variables are categorical variables with an inherent order – Small, medium, large – Cooperative, uncooperative Can summarize in the ordinary way: proportions / percentages Can also transform them into continuous variables by assigning categories points on a scale, then calculating a mean Not always recommended because “distances” between points on scale may not be equal, causing misleading results Is the distance between “Admonished” and “Informal” same as between “Informal and Citation”? “Citation” and “Arrest”? Value Severity of Disposition 4 Arrested Youths Freq. % 16 24 3 Citation or official reprimand 9 14 2 Informal reprimand 16 24 1 Admonished & released 25 38 Total (N) 66 100 Severity of disposition mean = 2.24 (25 X 1) + (16 X 2) + (9 X 3) + (16 X 4) / 66 Summarizing the distribution of continuous variables - the median • Median can be used with continuous or ordinal variables • Median is a useful summary statistic when there are extreme scores, making the mean misleading • In this example, which is identical to the preceding page except for one outlier (16), the mean is 3.5 – .5 higher • But the medians (3.0) are the same arrests 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6 Exercise 1: 2, 3, 5, 5, 8, 12, 17, 19, 21 Exercise 2: Compute... 2, 3, 5, 5, 8, 12, 17, 19, 21, 21 3+3/2=3 • Answers to preceding slide Exercise 1: 2, 3, 5, 5, 8, 12, 17, 19, 21 Answer: 8 Exercise 2: 2, 3, 5, 5, 8, 12, 17, 19, 21, 21 Answer: 10 (8 + 12 / 2) arrests 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 16 3+3/2=3 Summarizing the distribution of continuous variables - the mode • Score that occurs most often (with the greatest frequency) • Here the mode is 3 • Modes are a useful summary statistic when cases cluster at particular scores – an interesting condition that might otherwise be overlooked • Symmetrical distributions, like this one, are called “normal” distributions. In such distributions the mean, mode and median are the same. Near-normal distributions are common. • There can be more than one mode (bi-modal, tri-modal, etc.). Identify the modes: arrests • Exercise 1: 2, 3, 5, 5, 8, 12, 17, 19, 21 • Exercise 2: 2, 3, 5, 5, 8, 12, 17, 19, 21, 21 A final way to depict the distribution of continuous variables - the range • Answers to preceding side Exercise 1: 2, 3, 5, 5, 8, 12, 17, 19, 21 Mode = 5 (unimodal) Exercise 2: 2, 3, 5, 5, 8, 12, 17, 19, 21, 21 Modes = 5, 21 (bimodal) • Range: a simple way to convey the distribution of a continuous variable – Depicts the lowest and highest scores in a distribution 2, 3, 5, 5, 8, 12, 17, 19, 21 – range is “2 to 21” – Range can also be defined as the difference between the scores (21-2 = 19). If so, minimum and maximum scores should also be given. – Useful to cite range if there are outliers (extreme scores) that misleadingly distort the shape of the distribution Case no. Practical exercise • Calculate your class summary statistics for age and height – mean, median, mode and range • Pictorially depict the distributions for age and height, placing the variables and frequencies on the correct axes Next week – Every week: Without fail – bring an approved calculator – the same one you will use for the exam. It must be a basic calculator with a square root key. NOT a scientific or graphing calculator. NOT a cell phone, etc. Case No. Income No. of arrests Gender 1 15600 4 M 2 21380 3 F 3 17220 5 F 4 18765 2 M 5 23220 1 F 6 44500 0 M 7 34255 0 F 8 21620 0 F 9 14890 1 M 2. Pictorially depict the distribution of arrests 10 16650 2 F 11 44500 1 F 12 16730 3 M 3. Pictorially depict the distribution of gender 13 23980 3 F 14 14005 0 F 15 21550 2 M 16 26780 4 M 17 18050 1 F 18 34500 1 M 19 33785 3 F 20 21450 2 F HOMEWORK (link on weekly schedule) 1. Calculate all appropriate summary statistics for each distribution