STATISTICS for MANAGERS Fellowship Course on Health System Management A Keshtkar MD, MPH, PhD Assistant Professor of Epidemiology Why a Manager Needs to Know about Statistics To know how to: properly present information draw conclusions about populations based on sample information improve processes obtain reliable forecasts Key Definitions A population (universe) is the collection of all items or things under consideration A sample is a portion of the population selected for analysis A parameter is a summary measure that describes a characteristic of the population Population vs. Sample Population a b Sample cd b ef gh i jk l m n o p q rs t u v w x y z c gi o n r u y Measures used to describe the population are called parameters Measures computed from sample data are called statistics For example: population MEAN For example: sample MEAN Two Branches of Statistics Descriptive statistics Collecting, summarizing, and describing data Inferential statistics Drawing conclusions and/or making decisions concerning a population based only on sample data Descriptive Statistics 3 major Functions: Collect data Present data e.g., Survey e.g., Tables and graphs Characterize data e.g., Sample mean = X n i Inferential Statistics 2 major Functions: Estimation e.g., Estimate the population mean weight using the sample mean weight Hypothesis testing e.g., Test the claim that the population mean weight is 120 pounds Drawing conclusions and/or making decisions concerning a population based on sample results. Data Sources Primary Secondary Data Collection Data Compilation Print or Electronic Observation Survey Experimentation Reasons for Drawing a Sample Less time consuming than a census Less costly to administer than a census Less cumbersome and more practical to administer than a census of the targeted population Types of Sampling Methods Non-probability Sampling Items included are chosen without regard to their probability of occurrence Probability Sampling Items in the sample are chosen on the basis of known probabilities Types of Samples Used (continued) Samples Non-Probability Samples Judgement Quota Chunk Convenience Probability Samples Simple Random Stratified Systematic Cluster Probability Sampling Items in the sample are chosen based on known probabilities Probability Samples Simple Random Systematic Stratified Cluster Simple Random Samples Every individual or item from the frame has an equal chance of being selected Selection may be with replacement or without replacement Samples obtained from table of random numbers or computer random number generators Systematic Samples Decide on sample size: n Divide frame of N individuals into groups of k individuals: k=N/n Randomly select one individual from the 1st group Select every kth individual thereafter N = 64 n=8 k=8 First Group Stratified Samples Divide population into two or more subgroups (called strata) according to some common characteristic A simple random sample is selected from each subgroup, with sample sizes proportional to strata sizes Samples from subgroups are combined into one Population Divided into 4 strata Sample Cluster Samples Population is divided into several “clusters,” each representative of the population A simple random sample of clusters is selected All items in the selected clusters can be used, or items can be chosen from a cluster using another probability sampling technique Population divided into 16 clusters. Randomly selected clusters for sample Advantages and Disadvantages Simple random sample and systematic sample Stratified sample Simple to use May not be a good representation of the population’s underlying characteristics Ensures representation of individuals across the entire population Cluster sample More cost effective Less efficient (need larger sample to acquire the same level of precision) Types of Data Data Categorical Numerical Examples: Marital Status Political Party Eye Color (Defined categories) Discrete Examples: Number of Children Defects per hour (Counted items) Continuous Examples: Weight Voltage (Measured characteristics) Levels of Measurement and Measurement Scales Differences between measurements, true zero exists Ratio Data Differences between measurements but no true zero Interval Data Ordered Categories (rankings, order, or scaling) Ordinal Data Categories (no ordering or direction) Nominal Data Highest Level Strongest forms of measurement Higher Level Lowest Level Weakest form of measurement Definition of SURVEY A “survey” is a study type that usually has two characteristics: 1. Representativeness is an important goal 2. Data collection tool & method is questionnaire and interview/ QA-ing (Questioning & Answering) respectively. Evaluating Survey Worthiness What is the purpose of the survey? Is the survey based on a probability sample? Coverage error – appropriate frame? Non-response error – follow up Measurement error – good questions elicit good responses Sampling error – always exists Types of Survey Errors Coverage error or selection bias Non response error or bias People who do not respond may be different from those who do respond Sampling error Exists if some groups are excluded from the frame and have no chance of being selected Variation from sample to sample will always exist Measurement error Due to weaknesses in question design, respondent error, and interviewer’s effects on the respondent Types of Survey Errors (continued) Coverage error Excluded from frame Non-response error Follow up on nonresponses Sampling error Random differences from sample to sample Measurement error Bad or leading question Organizing and Presenting Data Graphically Data in raw form are usually not easy to use for decision making Some type of organization is needed Table Graph Techniques reviewed here: Frequency Distributions and Histograms Bar charts and pie charts Contingency tables Tables and Charts for Numerical Data Numerical Data Continuous Data Discrete Data Line or Polygon Frequency Distributions and Cumulative Distributions Histogram Polygon Box plot Tabulating Numerical Data: Frequency Distributions What is a Frequency Distribution? A frequency distribution is a list or a table … containing class groupings (categories or ranges within which the data falls) ... and the corresponding frequencies with which data falls within each grouping or category Why Use Frequency Distributions? A frequency distribution is a way to summarize data The distribution condenses the raw data into a more useful form... and allows for a quick visual interpretation of the data Class Intervals and Class Boundaries Each class grouping has the same width Determine the width of each interval by range Width of int erval number of desired class groupings Use at least 5 but no more than 15 groupings Class boundaries never overlap Round up the interval width to get desirable endpoints Frequency Distribution Example Example: A manufacturer of insulation randomly selects 20 winter days and records the daily high temperature 24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27 Frequency Distribution Example (continued) Sort raw data in ascending order: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Find range: 58 - 12 = 46 Select number of classes: 5 (usually between 5 and 15) Compute class interval (width): 10 (46/5 then round up) Determine class boundaries (limits): 10, 20, 30, 40, 50, 60 Compute class midpoints: 15, 25, 35, 45, Count observations & assign to classes 55 Frequency Distribution Example (continued) Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Class 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 Total Frequency Relative Frequency 3 6 5 4 2 20 .15 .30 .25 .20 .10 1.00 Percentage 15 30 25 20 10 100 Graphing Numerical Data: The Histogram A graph of the data in a frequency distribution is called a histogram The class boundaries (or class midpoints) are shown on the horizontal axis the vertical axis is either frequency, relative frequency, or percentage Bars of the appropriate heights are used to represent the number of observations within each class Histogram Example Class Midpoint Frequency Class 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 15 25 35 45 55 3 6 5 4 2 Histogram : Daily High Tem perature 7 6 Frequency 6 (No gaps between bars) 5 5 4 4 3 3 2 2 1 0 0 0 5 15 25 35 45 Class Midpoints 55 More Histograms in Excel 1 Select Tools/Data Analysis Histograms in Excel (continued) 2 Choose Histogram ( Input data range and bin range (bin range is a cell 3 range containing the upper class boundaries for each class grouping) Select Chart Output and click “OK” Questions for Grouping Data into Classes 1. How wide should each interval be? (How many classes should be used?) 2. How should the endpoints of the intervals be determined? Often answered by trial and error, subject to user judgment The goal is to create a distribution that is neither too "jagged" nor too "blocky” Goal is to appropriately show the pattern of variation in the data How Many Class Intervals? Many (Narrow class intervals) 3 2.5 2 1.5 1 0.5 60 Temperature Few (Wide class intervals) may compress variation too much and yield a blocky distribution can obscure important patterns of variation. 12 10 Frequency 8 6 4 2 0 0 30 60 More Temperature (X axis labels are upper class endpoints) More 56 52 48 44 40 36 32 28 24 20 16 8 0 4 may yield a very jagged distribution with gaps from empty classes Can give a poor indication of how frequency varies across classes 12 3.5 Frequency Graphing Numerical Data: The Frequency Polygon Class Midpoint Frequency Class 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 15 25 35 45 55 3 6 5 4 2 Frequency Polygon: Daily High Temperature 7 (In a percentage polygon the vertical axis would be defined to show the percentage of observations per class) Frequency 6 5 4 3 2 1 0 5 15 25 35 Class Midpoints 45 55 More Tabulating Numerical Data: Cumulative Frequency Data in ordered array: 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58 Class Frequency Percentage Cumulative Cumulative Frequency Percentage 10 but less than 20 3 15 3 15 20 but less than 30 6 30 9 45 30 but less than 40 5 25 14 70 40 but less than 50 4 20 18 90 50 but less than 60 2 10 20 100 20 100 Total Graphing Cumulative Frequencies: The Ogive (Cumulative % Polygon) Less than 10 10 but less than 20 20 but less than 30 30 but less than 40 40 but less than 50 50 but less than 60 10 20 30 40 50 60 0 15 45 70 90 100 Ogive: Daily High Temperature 100 Cumulative Percentage Class Lower Cumulative class boundary Percentage 80 60 40 20 0 10 20 30 40 50 60 Class Boundaries (Not Midpoints) Scatter Diagrams Scatter Diagrams are used for bivariate numerical data Bivariate data consists of paired observations taken from two numerical variables The Scatter Diagram: one variable is measured on the vertical axis and the other variable is measured on the horizontal axis Scatter Diagram Example Cost per day 23 125 26 140 29 146 33 160 38 167 42 170 50 188 55 195 60 200 Cost per Day vs. Production Volume 250 Cost per Day Volume per day 200 150 100 50 0 0 10 20 30 40 Volume per Day 50 60 70 Scatter Diagrams in Excel 1 Select the chart wizard 2 Select XY(Scatter) option, then click “Next” 3 When prompted, enter the data range, desired legend, and desired destination to complete the scatter diagram Tables and Charts for Categorical Data Categorical Data Graphing Data Tabulating Data Summary Table Bar Charts Pie Charts Pareto Diagram The Summary Table Summarize data by category Example: Current Investment Portfolio Investment Amount Percentage Type (in thousands $) (%) (Variables are Categorical) Stocks Bonds CD Savings 46.5 32.0 15.5 16.0 42.27 29.09 14.09 14.55 Total 110.0 100.0 Bar and Pie Charts Bar charts and Pie charts are often used for qualitative (category) data Height of bar or size of pie slice shows the frequency or percentage for each category Bar Chart Example Current Investment Portfolio Investment Type Amount Percentage (in thousands $) (%) Stocks Bonds CD Savings 46.5 32.0 15.5 16.0 42.27 29.09 14.09 14.55 Total 110.0 100.0 Investor's Portfolio Savings CD Bonds Stocks 0 10 20 30 Amount in $1000's 40 50 Pie Chart Example Current Investment Portfolio Investment Type Amount Percentage (in thousands $) (%) Stocks Bonds CD Savings 46.5 32.0 15.5 16.0 42.27 29.09 14.09 14.55 Total 110.0 100.0 Savings 15% Stocks 42% CD 14% Bonds 29% Percentages are rounded to the nearest percent Pareto Diagram Used to portray categorical data A bar chart, where categories are shown in descending order of frequency A cumulative polygon is often shown in the same graph Used to separate the “vital few” from the “trivial many” Pareto Diagram Example 45% 100% 40% 90% 80% 35% 70% 30% 60% 25% 50% 20% 40% 15% 30% 10% 20% 5% 10% 0% 0% Stocks Bonds Savings CD cumulative % invested (line graph) % invested in each category (bar graph) Current Investment Portfolio Tabulating and Graphing Multivariate Categorical Data Contingency Table for Investment Choices ($1000’s) Investment Category Investor A Investor B Investor C Total Stocks 46.5 55 27.5 129 Bonds CD Savings 32.0 15.5 16.0 44 20 28 19.0 13.5 7.0 95 49 51 Total 110.0 147 67.0 324 (Individual values could also be expressed as percentages of the overall total, percentages of the row totals, or percentages of the column totals) Tabulating and Graphing Multivariate Categorical Data (continued) Side by side bar charts C o m p arin g In vesto rs S a vin g s CD B onds S to c k s 0 10 In ve s to r A 20 30 In ve s to r B 40 50 In ve s to r C 60 Side-by-Side Chart Example Sales by quarter for three sales territories: East West North 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr 20.4 27.4 59 20.4 30.6 38.6 34.6 31.6 45.9 46.9 45 43.9 60 50 40 East West North 30 20 10 0 1st Qtr 2nd Qtr 3rd Qtr 4th Qtr Principles of Graphical Excellence Present data in a way that provides substance, statistics and design Communicate complex ideas with clarity, precision and efficiency Give the largest number of ideas in the most efficient manner Excellence almost always involves several dimensions Tell the truth about the data Errors in Presenting Data Using “chart junk” Failing to provide a relative basis in comparing data between groups Compressing or distorting the vertical axis Providing no zero point on the vertical axis Chart Junk Bad Presentation Good Presentation Minimum Wage 1960: $1.00 1970: $1.60 1980: $3.10 $ 4 2 0 1960 1990: $3.80 Minimum Wage 1970 1980 1990 No Relative Basis listen Bad Presentation Freq. A’s received by students. 300 200 100 Good Presentation % 30% A’s received by students. 20% 10% 0 0% FR SO JR SR FR SO JR SR FR = Freshmen, SO = Sophomore, JR = Junior, SR = Senior Compressing Vertical Axis Bad Presentation Good Presentation Quarterly Sales 200 $ Quarterly Sales 50 100 25 0 0 Q1 Q2 Q3 Q4 $ Q1 Q2 Q3 Q4 No Zero Point On Vertical Axis Bad Presentation $Good Presentations Monthly Sales 45 Monthly Sales 45 $ 39 36 42 0 39 36 42 or J F M A M J J F J F M A M J $ 60 40 Graphing the first six months of sales 20 0 M A M J Different Measures for Describing Data Measures of central tendency, variation, and shape Mean, median, mode, geometric mean Quartiles Range, interquartile range (IQR), variance and standard deviation, coefficient of variation (CV) Symmetric and skewed distributions Population summary measures Mean, variance, and standard deviation Normal Distribution versus Non-normal Distribution The empirical ND rule and Chebyshev rule Summary Measures Describing Data Numerically Central Tendency Quartiles Variation Arithmetic Mean Range Median Interquartile Range Mode Variance Geometric Mean Standard Deviation Shape Skewness Coefficient of Variation Measures of Central Tendency Overview Central Tendency Arithmetic Mean Median Mode n X X i1 n Geometric Mean XG ( X1 X 2 Xn )1/ n i Midpoint of ranked values Most frequently observed value Arithmetic Mean The arithmetic mean (mean) is the most common measure of central tendency For a sample of size n: n X Sample size X i1 n i X1 X2 Xn n Observed values Arithmetic Mean (continued) The most common measure of central tendency Mean = sum of values divided by the number of values Affected by extreme values (outliers) 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 1 2 3 4 5 15 3 5 5 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 1 2 3 4 10 20 4 5 5 Median In an ordered array, the median is the “middle” number (50% above, 50% below) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Median = 3 Median = 3 Not affected by extreme values Finding the Median The location of the median: n 1 Median position position in the ordered data 2 If the number of values is odd, the median is the middle number If the number of values is even, the median is the average of the two middle numbers n 1 is not the value of the median, only the 2 position of the median in the ranked data Note that Mode A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or categorical data There may may be no mode There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 9 0 1 2 3 4 5 6 No Mode Review Example Five houses on a hill by the beach $2,000 K House Prices: $2,000,000 500,000 300,000 100,000 100,000 $500 K $300 K $100 K $100 K Review Example: Summary Statistics House Prices: $2,000,000 500,000 300,000 100,000 100,000 Mean: Median: middle value of ranked data = $300,000 Mode: most frequent value = $100,000 Sum 3,000,000 ($3,000,000/5) = $600,000 Which measure of location is the “best”? Mean is generally used, unless extreme values (outliers) exist Then median is often used, since the median is not sensitive to extreme values. Example: Median home prices may be reported for a region – less sensitive to outliers Geometric Mean Geometric mean Used to measure the rate of change of a variable over time XG ( X1 X 2 Xn ) 1/ n Geometric mean rate of return Measures the status of an investment over time RG [(1 R1 ) (1 R 2 ) (1 Rn )]1/ n 1 Where Ri is the rate of return in time period i Example An investment of $100,000 declined to $50,000 at the end of year one and rebounded to $100,000 at end of year two: X1 $100,000 X2 $50,000 50% decrease X3 $100,000 100% increase The overall two-year return is zero, since it started and ended at the same level. Example (continued) Use the 1-year returns to compute the arithmetic mean and the geometric mean: Arithmetic mean rate of return: ( 50%) (100%) X 25% 2 Geometric mean rate of return: RG [(1 R1 ) (1 R 2 ) (1 Rn )]1/ n 1 Misleading result [(1 ( 50%)) (1 (100%))]1/ 2 1 [(. 50) (2)]1/ 2 1 11/ 2 1 0% More accurate result Quartiles Quartiles split the ranked data into 4 segments with an equal number of values per segment 25% Q1 25% 25% Q2 25% Q3 The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger Q2 is the same as the median (50% are smaller, 50% are larger) Only 25% of the observations are greater than the third quartile Quartile Formulas Find a quartile by determining the value in the appropriate position in the ranked data, where First quartile position: Q1 = (n+1)/4 Second quartile position: Q2 = (n+1)/2 (the median position) Third quartile position: Q3 = 3(n+1)/4 where n is the number of observed values Quartiles Example: Find the first quartile Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22 (n = 9) Q1 = is in the (9+1)/4 = 2.5 position of the ranked data so use the value half way between the 2nd and 3rd values, so Q1 = 12.5 Q1 and Q3 are measures of noncentral location Q2 = median, a measure of central tendency Measures of Variation Variation Range Interquartile Range Variance Standard Deviation Coefficient of Variation Measures of variation give information on the spread or variability of the data values. Same center, different variation Range Simplest measure of variation Difference between the largest and the smallest observations: Range = Xlargest – Xsmallest Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 Range = 14 - 1 = 13 13 14 Disadvantages of the Range Ignores the way in which data are distributed 7 8 9 10 11 12 Range = 12 - 7 = 5 7 8 9 10 11 12 Range = 12 - 7 = 5 Sensitive to outliers 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 Range = 5 - 1 = 4 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 120 - 1 = 119 Interquartile Range Can eliminate some outlier problems by using the interquartile range Eliminate some high- and low-valued observations and calculate the range from the remaining values Interquartile range = 3rd quartile – 1st quartile = Q3 – Q1 Interquartile Range Example: X minimum Q1 25% 12 Median (Q2) 25% 30 25% 45 X Q3 maximum 25% 57 Interquartile range = 57 – 30 = 27 70 Variance Average (approximately) of squared deviations of values from the mean n Sample variance: S 2 Where (X X) i1 X = arithmetic mean n = sample size Xi = ith value of the variable X i n -1 2 Standard Deviation Most commonly used measure of variation Shows variation about the mean Has the same units as the original data n Sample standard deviation: S (X i1 i X) n -1 2 Calculation Example: Sample Standard Deviation Sample Data (Xi) : 10 12 14 n=8 S 15 17 18 18 24 Mean = X = 16 (10 X)2 (12 X)2 (14 X)2 (24 X)2 n 1 (10 16)2 (12 16)2 (14 16)2 (24 16)2 8 1 126 7 4.2426 A measure of the “average” scatter around the mean Measuring variation Small standard deviation Large standard deviation Comparing Standard Deviations Data A 11 12 13 14 15 16 17 18 19 20 21 Mean = 15.5 S = 3.338 20 21 Mean = 15.5 S = 0.926 20 21 Mean = 15.5 S = 4.570 Data B 11 12 13 14 15 16 17 18 19 Data C 11 12 13 14 15 16 17 18 19 Advantages of Variance and Standard Deviation Each value in the data set is used in the calculation Values far from the mean are given extra weight (because deviations from the mean are squared) Coefficient of Variation Measures relative variation Always in percentage (%) Shows variation relative to mean Can be used to compare two or more sets of data measured in different units S 100% CV X Comparing Coefficient of Variation Stock A: Average price last year = $50 Standard deviation = $5 S $5 CVA 100% 100% 10% $50 X Stock B: Average price last year = $100 Standard deviation = $5 S $5 CVB 100% 100% 5% $100 X Both stocks have the same standard deviation, but stock B is less variable relative to its price Shape of a Distribution Describes how data is distributed Measures of shape Symmetric or skewed Left-Skewed Symmetric Right-Skewed Mean < Median Mean = Median Median < Mean Using Microsoft Excel Descriptive Statistics can be obtained from Microsoft® Excel Use menu choice: tools / data analysis / descriptive statistics Enter details in dialog box Using Excel Use menu choice: tools / data analysis / descriptive statistics Using Excel (continued) Enter dialog box details Check box for summary statistics Click OK Excel output Microsoft Excel descriptive statistics output, using the house price data: House Prices: $2,000,000 500,000 300,000 100,000 100,000 Population Summary Measures Population summary measures are called parameters The population mean is the sum of the values in the population divided by the population size, N N Where X i1 N i X1 X2 XN N μ = population mean N = population size Xi = ith value of the variable X Population Variance Average of squared deviations of values from the mean N Population variance: σ2 Where (X μ) i1 μ = population mean N = population size Xi = ith value of the variable X i N 2 Population Standard Deviation Most commonly used measure of variation Shows variation about the mean Has the same units as the original data Population standard deviation: N σ 2 (X μ) i i1 N The Empirical Rule If the data distribution is bell-shaped, then the interval: μ 1σ contains about 68% of the values in the population or the sample 68% μ μ 1σ The Empirical Rule μ 2σ contains about 95% of the values in the population or the sample μ 3σ contains about 99.7% of the values in the population or the sample 95% 99.7% μ 2σ μ 3σ Chebyshev Rule Regardless of how the data are distributed, at least (1 - 1/k2) of the values will fall within k standard deviations of the mean (for k > 1) Examples: At least within (1 - 1/12) = 0% ……..... k=1 (μ ± 1σ) (1 - 1/22) = 75% …........ k=2 (μ ± 2σ) (1 - 1/32) = 89% ………. k=3 (μ ± 3σ) Exploratory Data Analysis Box-and-Whisker Plot: A Graphical display of data using 5-number summary: Minimum -- Q1 -- Median -- Q3 -- Maximum Example: 25% Minimum Minimum 25% 1st Quartile 1st Quartile 25% Median Median 25% 3rd Quartile 3rd Quartile Maximum Maximum Shape of Box-and-Whisker Plots The Box and central line are centered between the endpoints if data are symmetric around the median Min Q1 Median Q3 Max A Box-and-Whisker plot can be shown in either vertical or horizontal format Distribution Shape and Box-and-Whisker Plot Left-Skewed Q1 Q2 Q3 Symmetric Q1 Q2 Q3 Right-Skewed Q1 Q2 Q3 Box-and-Whisker Plot Example Below is a Box-and-Whisker plot for the following data: Min 0 Q1 2 2 Q2 2 00 22 33 55 3 3 Q3 4 5 5 Max 10 27 27 This data is right skewed, as the plot depicts 27 The Sample Covariance The sample covariance measures the strength of the linear relationship between two variables (called bivariate data) The sample covariance: n cov ( X , Y ) ( X X)( Y Y ) i1 i i n 1 Only concerned with the strength of the relationship No causal effect is implied Interpreting Covariance Covariance between two random variables: cov(X,Y) > 0 X and Y tend to move in the same direction cov(X,Y) < 0 X and Y tend to move in opposite directions cov(X,Y) = 0 X and Y are independent Coefficient of Correlation Measures the relative strength of the linear relationship between two variables Sample coefficient of correlation: n r ( X X)( Y Y ) i1 i i n n ( X X) ( Y Y ) 2 i1 i i 1 i 2 cov ( X , Y ) SX SY Features of Correlation Coefficient, r Unit free Ranges between –1 and 1 The closer to –1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker any positive linear relationship Scatter Plots of Data with Various Correlation Coefficients Y Y Y X X r = -1 r = -.6 Y r=0 Y Y r = +1 X X X r = +.3 X r=0 Using Excel to Find the Correlation Coefficient Select Tools/Data Analysis Choose Correlation from the selection menu Click OK . . . Using Excel to Find the Correlation Coefficient (continued) Input data range and select appropriate options Click OK to get output Interpreting the Result Scatter Plot of Test Scores r = .733 100 There is a relatively strong positive linear relationship between test score #1 and test score #2 Test #2 Score 95 90 85 80 75 70 70 75 80 85 90 Test #1 Score Students who scored high on the first test tended to score high on second test 95 100 Pitfalls in Numerical Descriptive Measures Data analysis is objective Should report the summary measures that best meet the assumptions about the data set Data interpretation is subjective Should be done in fair, neutral and clear manner Ethical Considerations Numerical descriptive measures: Should document both good and bad results Should be presented in a fair, objective and neutral manner Should not use inappropriate summary measures to distort facts