hss2381A – quantitative methods Univariate Analysis part 2 Frequency Analysis WHAT THE HECK ARE ALL THOSE NUMBERS??? Frequency Distributions • That’s what a frequency distribution is for—to help impose order on the data • A frequency distribution is a systematic arrangement of data values, with a count of how many times each value occurred in a dataset Uses of Frequency Distributions in Data Analysis • First step in understanding your data! – Begin by looking at the frequency distributions for all or most variables, to “get a feel” for the data – Through inspection of frequency distributions, you can begin to assess how “clean” the data are Data Cleaning • One aspect of data cleaning involves seeing whether the frequency distribution contains: – Outliers: Values that lie outside the normal range of values, and that may or may not be legitimate – Wild codes: Impossible or invalid codes, like a code of “3” for the variable sex when valid codes are 1 (female) and 2 (male) Wild Codes Codes for Sex Frequency Percent 1 (Female) 49 49.0% 2 (Male) 47 47.0% 3 1 1.0% 7 2 2.0% Total 100 100.0% The codes 3 and 7 are WILD! Missing Values • Frequency distributions can help you assess the pervasiveness of a thorny problem in data analysis: – Missing data Wanted: Missing Number! Description: Data Values in Important Study Last seen: Date of Enrollment Missing from: My Dataset If Found: Contact Me! Inspection for Missing Values Sex Frequency Percent Valid % 1 (Female) 46 46.0 51.7 2 (Male) 43 43.0 48.3 7 (Refused) 11 11.0 100 100.0 Total 100.0 11.0% of the data are missing because participants refused to report their sex Assumptions • Frequency distributions can help you assess validity of certain assumptions for many statistical tests – An assumption is a condition presumed to be true and, when violated, can result in invalid results – For many inferential statistics, a normal distribution (for the dependent variable) is assumed Describe Sample • Frequency distributions can help you better understand the type of people who are in your study sample: – What percent are men? – What percent are African American? – What percent have a college degree? Answer Descriptive Questions • Frequency distributions can sometimes be used to answer descriptive research questions • BUT…inferential statistics are almost always needed, because they allow you to draw inferences about a broader group than the study sample Frequency Distributions in SPSS • • Use the Analyze Descriptive Statistics Frequencies command Click “Analyze” in the top toolbar menu, which brings up a pop-up menu; select Descriptives Frequencies Command in SPSS • • • All variables in dataset are listed in box on left Use arrow to move desired variable into slot marked “Variable(s)” Pushbuttons provide various options Frequencies: Statistics Options in SPSS • • Many available options within Frequencies: Statistics Here we see that we can select statistics for skewness and kurtosis Frequencies: Chart Options in SPSS • • • The Charts option allows you to create bar charts, pie charts, and histograms Normal curve superimposed: An option for Histograms Chart values can be Frequencies or Percentage (not available for Histograms) Graphs in SPSS • An even wider array of graphs can be created using the Graphs menu on the main toolbar Characteristics of a Data Distribution • Shape (Chapter 2) • Central tendency • Variability – Both central tendency and variability can be expressed by indexes that are descriptive statistics Central Tendency • Indexes of central tendency provide a single number to characterize a distribution • Measures of central tendency come from the center of the distribution of data values, indicating what is “typical,” and where data values tend to cluster • Popularly called an “average” Central Tendency Indexes • Three alternative indexes: – The mode – The median – The mean The Mode • The mode is the score value with the highest frequency; the most “popular” score – Age: 26 27 27 28 29 30 31 – Mode = 27 2.5 2.0 1.5 1.0 Std. D .5 Mean N = 7 0.0 26.0 27.0 28.0 29.0 AGE The mode 30.0 31.0 The Mode: Advantages • Can be used with data measured on any measurement level (including nominal level) • Easy to “compute” • Reflects an actual value in the distribution, so it is easy to understand • Useful when there are 2+ “popular” scores (i.e., in multimodal distributions) The Mode: Disadvantages • Ignores most information in the distribution • Tends to be unstable (i.e., value varies a lot from one sample to the next) • Some distributions may not have a mode (e.g., 10, 10, 11, 11, 12, 12) The Median • The median is the score that divides the distribution into two equal halves • 50% are below the median, 50% above – Age: 26 27 27 28 29 30 31 – Median (Mdn) = 28 2.5 2.0 1.5 1.0 Std. De .5 Mean = N = 7.0 0.0 26.0 AGE 27.0 28.0 29.0 30.0 The median 31.0 The Median: Advantages • Not influenced by outliers • Particularly good index of what is “typical” when distribution is skewed • Easy to “compute” • Appropriate when data are ordinal level The Median: Disadvantages • Does not take actual data values into account—only an index of position • Value of median not necessarily an actual data value, so it is more difficult to understand than mode The Mean • The mean is the arithmetic average 2.5 2.0 • Data values are summed and divided by N 1.5 1.0 Std. Dev = .5 Mean = 2 N = 7.00 0.0 – Age: 26 27 27 28 29 30 31 – Mean = 28.3 26.0 27.0 28.0 29.0 AGE The mean 30.0 31.0 The Mean (cont’d) • Most frequently used measure of central tendency—usually preferred for interval- and ratio-level data • Equation: M = ΣX ÷ N • Where: M = sample mean Σ = the sum of X = actual data values N = number of people The Mean: Advantages • The balance point in the distribution: – Sum of deviations above the mean always exactly balances those below it • Does not ignore any information • The most stable index of central tendency • Many inferential statistics are based on the mean The Mean: Disadvantages • Sensitive to outliers • Gives a distorted view of what is “typical” when data are skewed • Value of mean is often not an actual data value The Mean: Symbols • Sample means: – In reports, usually symbolized as M – In statistical formulas, usually symbolized as x(pronounced X bar) • Population means: – The Greek letter μ (mu) Central Tendency in Normal Distributions • In a normal distribution, all three indexes coincide Central Tendency in Skewed Distributions • In a skewed distribution, the mean is pulled “off center” in the direction of the skew Variability • Variability concerns how spread out or dispersed data values in a distribution are • Two distributions with the same mean could have different dispersion Variability (cont’d) • High variability: A heterogeneous distribution (A) • Low variability: A homogeneous distribution (B) Indexes of Variability • Range • Interquartile range • Standard deviation • Variance The Range • Range: The difference between the highest and lowest value in the distribution • Weights (pounds): 110 120 130 140 150 150 160 170 180 190 • The range here is 80 (190 – 110) The Range: Advantages • Easy to compute • Readily understood • Communicates information of interest to readers of a report The Range: Disadvantages • Depends on only two scores, does not take all information into account • Sensitive to outliers • Tends to be unstable—fluctuates from sample to sample • Influenced by sample size The Interquartile Range • Interquartile range (IQR): Based on quartiles – Lower quartile (Q1): Point below which 25% of scores lie – Upper quartile (Q3): Point below which 75% of scores lie • IQR = Q3 - Q1 – IQR is the range of scores within which the middle 50% of scores lie Consider this dataset (yanked from Wikipedia) Notice that Q2 is always the median N=11 n+1 = 12 Q2 = median = entry # (n+1)/2 Q1 = upper = entry # (n+1)/4 Q3 = lower = entry # 3(n+1)/4 Q1 = 3rd entry = 105 Q3 = 9th entry = 115 IQR = Q3-Q1 = 115-105 = 10 The Interquartile Range (cont’d) • Another Example: Weights (pounds): 110 120 130 140 150 160 170 180 190 • The IQR is 50.0 (175 – 125) • Let’s see how we get that…. Number of entry Value 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 Step 1 = where is the median? Quartile Number of entry Value 1 110 2 120 3 130 4 140 5 150 6 160 7 170 8 180 9 190 Quartile Q1=125 Q2 = median Q1 will be entry # (9+1)/4 = 2.5 = halfway between 120 and 130 Q3 will be entry # 3(9+1)/4 = 7.5 = halfway between 170 and 180 Q3=175 What if we have an even number? • IQR Example: Weights (pounds): 110 120 130 140 150 150 160 170 180 190 • The IQR is 45.0 (172.5 – 127.5) • Let’s see how we get that… Number of entry Value 1 110 2 120 3 130 4 140 5 150 6 150 7 160 8 170 9 180 10 190 Step 1 = where is the median? Quartile Number of entry Value 1 110 2 120 3 130 4 140 5 150 6 150 7 160 8 170 9 180 10 190 Q1=127.5 Q2=Median = 150 Q1 will be entry # (10+1)/4 = 2.75 = ¾ of the way between 120 and 130 Or... 120 + [(130-120) x 0.75] = 127.5 Number of entry Value 1 110 2 120 3 130 4 140 5 150 6 150 7 160 8 170 9 180 10 190 Q1=127.5 Q2=Median = 150 Q1=172.5 Q3 will be entry # 3(10+1)/4 = entry # 8¼ or 25% of the distance between 170 & 180 Or... 170 + [(180-170) x 0.25] = 172.5 Number of entry Value 1 110 2 120 3 130 4 140 5 150 6 150 7 160 8 170 9 180 10 190 IQR = q3-q1 = 172.5 – 127.5 = 45.0 Q1=127.5 Q2=Median = 150 Q1=172.5 If you want to check your work, use any stats software, or an online IQR calculator, such as: http://www.alcula.com/calculators/statistics/interquartile-range/ The Interquartile Range: Advantages • Reduces influence of outliers and extreme scores in expressing variability • Uses more information than the range • Important in evaluating outliers • Appropriate as index of variability with ordinal measures The Interquartile Range: Advantages The closer the clustering of values around the median, the smaller the interquartile range Small IQR shows clustering around the median. Why is this useful? The Interquartile Range: Disadvantages • Is not particularly easy to compute • Is not well understood • Does not take all values into account The Standard Deviation • Standard deviation (SD): An index that conveys how much, on average, scores in a distribution vary • SDs are based on deviation scores (x), calculated by subtracting the mean from each person’s original score x=X-M Standard Deviation Interpretation • In a normal distribution, a fixed percentage of cases lie within certain distances from the mean: We will do more with SD and variance... Measurement Scales and Descriptive Statistics Scale Central Variability Tendency Index Index Nominal Mode -- Ordinal Median Range, IQR Interval and ratio Mean Standard deviation, Variance Uses of Descriptive Statistics • Indexes of central tendency and variability are used to: – Understand data, get a “big picture” – Evaluate outliers and need for strategies to address problems (e.g., using a trimmed mean that recalculates mean after deleting a fixed percentage (e.g., 5% from either end) – Describe research participants (e.g., their age, education, length of illness) – Answer descriptive questions Descriptive Statistics in SPSS • Can be obtained through Analyze Descriptive Statistics and are obtained in three programs within that broad umbrella (each has slightly different options): – Frequencies Statistics – Descriptives Options – Explore Statistics Descriptive Statistics in SPSS Frequencies • • • • Percentile values Central tendency Dispersion (variability) Skewness and Kurtosis Descriptive Statistics in SPSS Descriptives • • • • • Mean (no median) Dispersion (variability) Skewness and Kurtosis No percentiles BUT has good display options Example • We ask a class of 10 students what their weight in pounds is. We get: Student Weight 1 2 3 4 5 6 7 8 9 10 98 102 175 165 160 148 320 102 111 55 Step 1 – rank the data Student Weight Student Weight 1 2 3 4 5 6 7 8 9 10 98 102 175 165 160 148 320 102 111 55 10 1 2 8 9 6 5 4 3 7 55 98 102 102 111 148 160 165 175 320 Total = 1436 Student Weight 10 1 2 8 9 6 5 4 3 7 55 98 102 102 111 148 160 165 175 320 Total = 1436 Mean = total/number of students = 1436/10 = 143.6 Mode = most common response = 102 Student Weight 10 1 2 8 9 6 5 4 3 7 55 98 102 102 111 148 160 165 175 320 Total = 1436 How do we find the median? Student Weight 10 1 2 8 9 6 5 4 3 7 55 98 102 102 111 148 160 165 175 320 Total = 1436 How do we find the median? Find the middle value. But since there are 10 values total, there are 2 middle values Then find the midpoint between the two by computing the mean of those two: (111+148)/2 = 129.5 Student Weight 10 1 2 8 9 6 5 4 3 7 55 98 102 102 111 148 160 165 175 320 Total = 1436 How do we find the range? Find maximum: Find minimum Subtract them: 320 55 265 How do we find the IQR? Student Weight 10 1 2 8 9 6 5 4 3 7 55 98 102 102 111 148 160 165 175 320 <- Q1 = 101.0 <-Q2=median = 129.5 <- Q3 = 167.5 Total = 1436 IQR = Q3-Q1 = 167.5 – 101.0 = 66.5 Homework • P. 57 A1, A2, A3