Measurement Mythbusters: 101

advertisement

Measurement Myth Busters 101

Joe Adams, Ph.D.

www.joeadams.net

Things to keep in mind:

All measurement contains error.

All measures are human creations.

All measures require an observer or instrument user.

More things to keep in mind:

Measurement is a discipline.

– What do you see?

– What do you hear?

– How do you look and listen objectivity?

– How do you describe/define the observation?

Myth #1: You can’t measure

(fill in the blank).

The Best Measures are Simple

Measures are a shorthand for experience or observations.

Knowing your subject matter counts!

If they can do it, you can too!

Don’t be fooled by naysayers.

Gilley’s song inspired four teams of researchers to test his hypothesis!

And he was almost right!

The Beat Goes On…

And so did the research:

– On Attractiveness

– On Mate Selection

– On Stability of Relationships

– On Genetic Cues, etc., etc…

– On a lot of things you really don’t want to know…

Myth #2: It’s all subjective!

“Beauty is in the eye of the beholder!”

The Distorted Cultural Legacy of A.J. Ayer (1910 – 1989)

Language, Truth, and Logic (1936)

The most famous spokesman for the fact/value dichotomy.

Claimed that all statements about values are merely expressions of emotion, with no logical significance.

Also a formidable opponent to Mike Tyson.

Ayer v. Tyson

“[Ayer] taught or lectured several times in the United States, including serving as a visiting professor at Bard College in the fall of 1987. At a party that same year held by fashion designer

Fernando Sanchez, Ayer, then 77, confronted Mike Tyson harassing the (then little-known) model Naomi Campbell. When

Ayer demanded that Tyson stop, the boxer said: "Do you know who the f*** I am? I'm the heavyweight champion of the world," to which Ayer replied: "And I am the former Wykeham

Professor of Logic. We are both pre-eminent in our field. I suggest that we talk about this like rational men". Ayer and

Tyson then began to talk, while Naomi Campbell slipped out.” -

Wikipedia

TKO – First Round!

Verifiable on Wikipedia

“The fact of twilight does not prevent us from distinguishing between day and night.”

Attributed to

Dr. Samuel Johnson (1709-1784)

The Real Issues Are:

Validity and

Reliability

Validity – Relevance - Logic

DESIRALBE QUALITIES:

– RELEVANCE: Measures should mean something important to those who use them – performance measures should drive performance!

– PURITY: Measures should deal with a clearly defined domain or dimension of a particular quality.

– REPRESENTATIVENESS: Measures should capture something about a phenomena without distorting the phenomena.

Invalid Measures

Tend to obscure reality, not illuminate it.

May lead to erroneous, spurious, or absurd conclusions.

In Application of Measures

Internal Threats to Validity

Selection – picking facts that fit hypothesis

History – observations taken at different times

Maturation Effect – subjects or effects mature

Repeated Testing – subjects get test-wise

Instrumentation – “breaks down” or used incorrectly

Experimental Mortality – people drop out

Experimenter Bias – creates expectations

Threats to External Validity

Generalizability of results may be limited by:

– TIME – Sample taken on Fat Tuesday!

– SETTING – During the Superbowl.

– PLACES – As they come out of Sugars…

– PEOPLE (SAMPLE) – Inside Sugars…

– OBSERVER – Barney Fife

Threats to External Validity

(Continued)

Generalizability of results may be limited by:

– Placebo Effect – MSU Health Plan

– Novelty Effect – Ooo wow!

– Hawthorn Effect – More below.

Summary of Validity Issues

Does the measure capture what you intend it to capture.

Artifacts of measurement

Artifacts of Measuring

Measures that pretend to be one thing, but are actually something else (e.g. pleasing answers).

An artifact might mean that the act of measuring caused something to register that wasn’t there

(e.g. questions about non-existent opinions).

The act of measurement disturbs the same reality it is measuring, a problem commonly known as the Heisenberg Principle (interviewers may make people self-conscious).

The Hawthorn Effect

General Electric plant at Hawthorn Works, outside Chicago in Cicero, Illinois

A series of studies done by Harvard professors between 1924 and 1932.

They were testing hypotheses about working conditions and productivity.

Treatment groups increased productivity regardless of conditions…

Why did they improve?

They felt “special” for being chosen to participate in the experiment.

The experiments spawned the whole Human

Relations school of thought in the field of management.

The Rosenthal Effect

Studies done by Robert Rosenthal and Lenore

Jacobson (1968/1992).

Also called the Pygmalion Effect.

Observer / Teacher expectations improved student results… more than different

“treatments.”

That’s the good news about teaching: It matters.

Reliability - Consistency

DESIRALBE QUALITIES:

– ROBUSTNESS: Measures should work well under of variety of extraneous conditions.

– PRECISESNESS: Measures should differentiate between different qualities or gradations.

– SENSITIVITY: Measures should detect change.

Intercoder Reliabilty

Inter-coder or inter-rater reliability : The results of two or more people correlation with each other on a particular item, using the same scale or instrument.

Problem: They see the same thing looking through the same lenses (but they were drunk).

– In the example from the Girls All Get Prettier at

Closing Time, inter-coder reliability on the attractiveness of females typically reaches .90, or 90 percent, depending on how you define reliability.

Most research in this area indicate a high degree of consistency from both sexes. Does drinking help?

Internal Consistency

Internal consistency : The result of one measure correlate with other similar, but different, measures measuring the same thing.

Problem: Error in the measures may be correlated more than the content. It’s the correlation between the measures that is the key to knowing whether the measures are reliable, but that might be a problem:

The observer was drunk again. (GIGO)

Test-retest Reliability

Test-retest reliability : Try measuring the same thing with the same instrument more than once to see if the results are the same.

Problem: The Barney Fife problem – the person using the instrument is part of the instrument (retest won’t catch this).

– Examples: Racial differences between interviewer and subject may shift responses on surveys dealing with race. Male versus female interviewers asking about sexual issues has the same problem.

Split-Half Reliability

 Split-half reliability : Use two equivalent forms of a scale to see if they correlate.

Example: Use two different questions in the same survey to measure the same thing. If they are correlated, you’ve demonstrated the reliability of the instrument(s).

Half Goofy: The MMPI

The Minnesota Multiphasic Personality

Inventory (1952 - )

– It’s the pattern, not the questions alone.

– Different axes (dimensions).

– The Diagnostic and Statistical Manual of Mental

Disorders (DSM)

Provides standardized diagnoses.

Describes some treatment protocols

Resources for Testing Validity and

Reliability

G. David Garson, Quantitative Research in

Public Administration http://www2.chass.ncsu.edu/garson/pA765/rel iab.htm

Wikipedia, Validity (Statistics) http://en.wikipedia.org/wiki/Validity_%28statis tics%29

Wikipedia, Validity (Logic) http://en.wikipedia.org/wiki/Validity

Myth # 3: Madison Avenue is home to the world’s greatest scientific minds (“Data proves (fill in the blank)”.

How often have you heard:

“Scientific research proves….”

Science does not prove, it disproves.

Key things to understand:

– In science, a null hypothesis is rejected or accepted.

– The outcome of any experiment or statistical comparison counts as only one observation, regardless of the number of data points.

– Different observations at different times may yield different results.

– Eternity is not ours to observe.

Key References

David Hume (1711 – 1776)– Noted that there is nothing logically necessary about the repetition of a pattern continuing in the future.

Ludwig Wittgenstein (1889 – 1951) – Wrote the

Tractatus Logico-Philosophicus, which outlines almost all of the rules of scientific endeavor, one of the most important points of which, is that the notion of causation is a purely intellectual construction and is never a fact.

Myth 4#: The whole is equal to sum of the parts.

AKA: The Ecological Fallacy

The Level of Measurement Matters

(A Logical Validity Issue)

Levels of Analysis: Examples

Individual – a person, single cell, atom, e.g. smallest discrete unit.

Group – may meet face-to-face

Organization – does not generally meet face-toface

State – a geopolitical jurisdiction

Nation – Like Texas y’all.

Aggregate measures cannot generally be used to estimate disaggregated behavior.

Conclusions about individual-level behavior cannot be drawn from aggregate comparisons.

Example: Emile Durkheim’s Study of Suicide.

Just because more Bavarians commit suicide,

Catholics are NOT more likely to commit suicide

Disaggregated data cannot generally be used to estimate aggregate behavior.

Conclusions about aggregate behavior cannot be drawn from individual level data.

Example: Hydrogen and Oxygen burn. H2O does not.

Not ALL Texans carry guns and wear cowboy hats.

Not ALL Austinites wear speedos and ride 10speeds downtown.

Maybe Not?

Gary King (1997). A Solution to the Ecological

Inference Problem, Princeton University Press.

Within limits, there may be “probable” statements about inferences between levels. The level of certainty about such statements can be estimated.

http://gking.harvard.edu/stats.shtml

Myth #5: Attitudes indicate behavior.

Classic Case:

Attitudes ≠ Behavior

– LaPiere, Richard T. “Attitudes vs. Actions,” Social

Forces, Vol. 13, No. 2. (Dec., 1934), pp. 230-237.

Actual Behavior

Customer Satisfaction?

Case #2 (1983)

Cenaré Italian Cuisine

404 East University Drive

College Station, Texas

The tale of the half-price special!

Dr. Robert A. Peterson

Associate Dean for Research at the University of

Texas’ McCombs School of Business

Robert A. Peterson and William R. Wilson (1992). Measuring Customer

Satisfaction: Fact and Artifact, Journal of the Academy of Marketing Science, Vol.

20, No. 1, 61-71.

Customer satisfaction surveys may be measuring how many happy people or unhappy people are in the sample, nothing more.

Myth #6: Quantitative data are different than qualitative data.

Developing Measures

“Quantification is merely a second order matching of primary qualities.”

Karl Wolfgang Deutsch (1912-1992)

Develop Powerful

Measures"!"

Three levels of measurement:

–Nominal – The weakest measure

–Ordinal – Mediocre, but not awful.

–Interval/Ratio – The best possible.

Nominal Measures

Nominal (Categorical) – refers to opaque qualities, color, sex, nationality, groups, etc. Must have no order or rank.

Problem: There might be a hidden order to the measure that is not immediately identifiable, particularly in cases where social status may correlate with other measures (income, education, etc.). The existence of some hidden order is an empirical question that can be tested.

Ordinal Measures

Interval / Ordinal Measures – have direction or dimension, a greater and lesser ends to the measure.

Likert or Guttman Scales, 7-point, 5-point, but no specific distance between points. Example: Scalding, hot, warm, cool, cold, freezing, etc…

Problem: Survey question construction may prompt an order (preference among candidates).

Randomization is a partial remedy.

Interval/Ratio

Interval / Ratio Measures – Most precise kind of measures. The have a constant interval of some kind, admits of degree, gradations, sometimes referred to as a “common metric.”

Problem: Intervals may not be constant (linear). The measures may hide uneven increments. An example is education in years. A year of college is not equal to a year of elementary school

(unless you went to t.u.)

Develop Powerful Measures"!"

The more precise the measure, the more powerful the analytical techniques that can be used

– Nominal: Crosstabs, Chi-square,

– Ordinal: Tau-b, rank order correlation, etc.

– Interval/ratio: Regression, time-series, etc.

Definitions Precision

The precision of the measure depends on two critical items:

– The quality of the definition, and

– The quality of the data collection system.

Parts of a Good Definition

A clear description of the purpose

A clear description of what the measure is supposed to measure

A clear description of how the measure is to be applied, which includes:

– Every step in the data collection process

– A means for identifying error in the collection process (what the measure is not)

An explanation of how the measure will be used.

“There are no facts, only interpretations.”

Friedrich Nietzsche (1844-1900)

Context Matters

What is the theory, hypothesis, or logic model that makes this measure sensible?

Is the measurement tied to a particular problem?

Is the problem an intellectual/academic question or a practical problem requiring a solution?

What question is the measure supposed to answer?

Some call them Paradigms

Concept popularized by Thomas Kuhn in the

Structure of Scientific Revolutions (1962).

The paradigm includes all the methods related to the practice of a scientific endeavor, including the instrumentation and operating assumptions.

Example: “Tell me about your mother…” http://en.wikipedia.org/wiki/Thomas_Samuel_Kuhn

What is your context?

Why do you need to measure something?

– To test a hypothesis?

– To make decisions about agency operations?

– To calculate cost/benefits?

– To demonstrate effectiveness?

– To understand what is happening?

To find someone to blame?

Theories that Work!

On Good Theories: On the characteristics of a good theory, see the work of Imre Lakatos , especially his book, The Methodology of Scientific

Research Programmes: Philosophical Papers Volume 1

(1977); and Harry G. Frankfurt's On Truth . (See also On Bullshit.)

Good theories exemplify the characteristics of parismony (simplicity, elegance), explanatory power (apply in a wide variety of situations), robustness (they operate in contaminated environments), and empirical support (fit facts

Feeling Good… was good enough for me and Bobby McGee…

Kris Kristofferson

(b. 1936, Brownsville, Texas)

Flow: The Science of Optimal Experience by Mihaly Csikszentmihalyi

Challenges

Flow

Anxiety

Boredom

Skills

The Good Work Project

Recommended Reading: Martin E.P.

Seligman, Authentic Happiness.com (Book

Website)See his What You Can Change and What you Can't and The Optimistic Child ; also see The

Science of Optimism and Hope: Research Essays in

Honor of Martin E. P. Seligman . Mihaly

Csikszentmihalyi's Flow: The Psychology of Optimal

Experience.

Also, see The Good Work Project website for applications of these theories.

Myth #7: Measures have to be exact.

“…it is the mark of an educated man to look for precision in each class of things just so far as the nature of the subject admits...”

- Aristotle

Nichomachian Ethics

Special Cases for Estimation

Measures that estimate ranges and compare proportions across two or more dimensions.

Measures that show relationships, trade-offs, and thresholds.

Measures that show what is not seen, residuals.

Flight Envelope Summarizes

Flight envelopes are estimated from available data which show the following characteristics:

a Take-off speed

b Stalling speed

c Ceiling, with corresponding speed

d Maximum level speed

d Maximum speed at altitude

f Maximum sea level speed

Two-dimensions: Flight Envelope

1. Altitude (expressed in ranges)

2. Speed (expressed in ranges)

Comparing Flight Envelopes

1. Combat helicopter (ex. Boeing AH-64 Apache)

2. Cargo aircraft (ex. Lockheed C-130J)

3. Subsonic transport aircraft (ex. Airbus A-300)

4. Supersonic fighter aircraft (ex. Lockheed F-16C) http://www.aerodyn.org/Atm-flight/flimit.html

Measuring Inequality

The Lorenz Curve describes any distribution of a quantity across any population.

– The Gini coefficient provides a global estimation of the degree of inequality within that population.

The Gini Coefficient

Trade-offs

Bounded by a zero point (no trade-off).

Change in A = Change in B = 0

Trade-offs between A & B may occur six ways:

– A increases, B decreases

– B increases, A decreases

– A increases more than B

– B increases more than A

– A decreases more than B

– B decreases more than A

Four Trade-off Conditions

Potential

Trade-offs

A Wins B Wins

Net Increase A > B A < B

Net Decrease A > B A < B

Four Basic Conditions

More on this later…

A Real-Life Measurement Problem

The Mississippi Department of Wildlife,

Fisheries, and Parks has an 8-week backlog in boat registration and sportsman’s licenses.

Delays do not discriminate between individuals, whether they be:

– Farmers

– Bankers

– Legislators, or

– Governors.

Myth #8: You have to observe subjects directly.

Measuring the Unseen

The Sherlock Holmes Approach

“We must fall back upon the old axiom that when all other contingencies fail, whatever remains, however improbable, must be the truth.”

– Sherlock Holmes

The Adventure of the Bruce Partington Plans

(Sir Arthur Conan Doyle)

We’re on the Case!

Whatever is left…

Using residuals to measure something indirectly has been a very useful technique in several arenas.

The Most Famous Example

The Double-Helix of DNA was not observed directly. In essence, Crick and Watson used

Rosalind Franklin’s x-rays of wet and dry strands of DNA.

Essentially, they were looking at the shadow of

DNA, not the DNA itself.

Example 2: Relative Political Capacity

Initial observations:

All political systems must have resources.

Those that are able to obtain resources are stronger than those that cannot.

Wealthier populations are able to pay more taxes than poorer populations.

Some economies are easier to tax than others.

People don’t like to pay taxes, unless they know they’ll get the money back (e.g. Social Security).

Predicted/Model vs. Actual

Observations that fall on the regression line were given a score of 1.00.

Those above were scored as a ratio of their predicted, if double, then 2.00, three times, 3.00 and so on…

Those below their predicted tax rate were given scores from 0 to .99, based on the percentage of the predicted scores.

Results: Uses of RPCs

Explains demographic transitions (population explosions or lack thereof).

Outcomes of wars between relatively equal/uneaqual opponents.

Black market exchange rates for currencies in unstable countries.

Let’s

Talk Performance!

Real Men and Women

Use Performance Measures!

(Wennies Don’t)

Performance measures should drive performance.

– There should be thresholds at which management takes action to do something different.

Example: Watch the altimeter for sudden drops, pull up on the yoke if the numbers go down.

– Those actions should be defined in some sort of plan

Example: At 500 feet, eject.

A Barometer is not a Performance

Measure!

Benchmarking

Choosing the Right Comparisons

Myth #9: Collin County Community

College is the perfect peer.

Peer-to-Peer

Choose statistical neighbors (like you).

– Comparisons need to make sense.

Choose those with a similar environment.

– Environments need to be “controlled” analytically.

Choose those who differ on performance.

– Variation requires explanation and understanding.

– Lack of variation means nothing can be learned.

Best of Breed

Choose those who out-perform the competition.

– That is the “benchmark” to beat.

Include those who do not perform well.

– This avoids the mistake of Tom Peters.

Compare environments, but choose on performance.

Establish

Baseline: Compare Trends

Track your own performance over time.

– Identify key internal and external factors.

– Test explanations (hypotheses)

Identify variations.

If there are no variations, you cannot draw any conclusions about causes. A constant explains nothing.

Myth #10: Good measures don’t vary.

Measures are Variables

And Variables Vary

No Variance?

– No chance of improvement

– No Gains

– No Learning

A Costly Example of No Variance

Parties, Ideologies, and Budgets: A Study of

Budget Trade-offs in 18 OECD Countries

– Based on data from 1960 to 1990

– 65,000 cells of data drawn from more than 50 sources, taking six months to enter by hand.

Results for Health vs. Defense

Results for Education vs. Defense

All is not lost

Mona Lisa

“Discovery”

Myth #11: Performance measures will improve performance.

Do Performance Measures

Improve Performance?

The Case of Texas State Agencies

Myth #12: Data integrity is exclusively a reporting issue.

Reporting is an operational issue.

CREATING INTEGRITY

BY

DESIGN

Alabama SMART Budgeting

Training

Qualities of Good Performance

Measures in the Real World

RELIABILITY

Consistency – Data can be replicated by a competent, trained professional (e.g. Auditor).

Accuracy – The indicators are true to the facts.

VALIDITY

Relevance – Measures relate to progress toward realistic agency/organizational goals.

Usefulness – They provide actionable indicators

Data Integrity Starts with People

Checklist:

Are reporting roles clearly defined?

Is there documentation?

– A ‘paper trail’ for auditing?

– Written procedures for verifying data accuracy?

– Clear responsibility for reviewing and approving performance measure reports?

Is there management ownership for performance measurement reports?

If the answer any of the first five questions is “No.”

Go back to the beginning.

Check every step from start to finish until the error or problem is identified.

If everything checks out, then it is time to look at program operations for answers.*

*

This is a job that is the exclusive responsibility of program management.

Question 6: Identify Root Causes

Is the change in performance the result of an internal or external factor?

– Can the relationship between internal or external factors and performance be demonstrated with data?

Do they correlate?

What are the patterns, trends, etc.?

– What factors can be changed by management?

Can staff, training, technology or funding change the result?

What do data indicate about these connections?

Response to 6: Action Plan

What is required to make change results?

– What new activities will be required to make those changes?

– What resources (or authority) would be required to implement those new activities?

– Who will implement new activities?

– When can the new activities begin?

– How long will it take for the new activities to have an effect?

Measurement Disasters

Tennessee Sour Mash:

Corn and Student Test Scores

The Situation

A University of Tennessee Ag Economics

Professor proposes using crop yield formulas for measuring the “value-added” increases in student test scores.

The Tennessee General Assembly promptly enacts the idea, granting the professor a contract as the sole-source provider, naming him personally in statute (name later removed in the Tennessee Code).

Question:

How do student test scores differ from corn?

Student Test Scores Crop Yields

Photo Credit: Lloyd Wolf/U.S. Census Bureau www.freephoto.com

What Type of Measures Are They?

Nominal?

Ordinal?

Interval?

“INTERVAL”

(BOTH MEASURES)

Corn can always grow taller!

Which School Would Do Better?

Which is more likely to show a gain?

100% 100%

95%

Starting

Points

50%

School A? School B?

“Not everything that counts can be measured, and not everything that can be measured counts”

Albert Einstein (1879-1955)

Before we accept the first premise, we have to ask,

“Have we tried?”

Myth #13: Measures can’t detect management issues.

Measuring What is Important

Organizational Culture

– Turnover – Big Clue!

– Absenteeism – Big Clue #2!

– Lack of initiative, passivity – Clue #3

– Low morale – Starting to see a pattern?

– Anger, frustration, discipline problems…

– Sense of hopelessness!!!!

How do we measure this?

Possible Index?

TEN RULES FOR STIFLING INNOVATION

1.

Regard any new idea from below with suspicion—because it’s new, and because it’s from below.

2.

Insist that people who need your approval to act first go through several other levels of management to get their signatures.

3.

Ask departments or individuals to challenge and criticize each other’s proposals. (That saves you the job of deciding; you just pick the survivor.)

4.

Discuss your criticisms freely, and withhold your praise. That keeps people on their toes. Let them know they can be fired at any time.

5.

Treat identification of problems as signs of failure, to discourage people from letting you know when something in their area isn’t working.

Cont…

TEN RULES FOR STIFLING INNOVATION (continued):

6.

Control everything carefully. Make sure people count anything that can be counted, frequently.

7.

Make decisions to reorganize or change policies in secret, arid spring them on people unexpectedly. (That also keeps People on their toes.) Let them know that they can be fired at any time.

8.

Make sure that requests for information are fully justified, and make sure that it is not given out to managers freely. (You don’t want data to fall into the wrong hands.)

9.

Assign to lower-level managers, in the name of delegation and participation responsibilities for figuring out how to cut back, lay oil, move people around, or otherwise implement threatening decisions you have made, and get them to do it quickly.

10.

And above all, never forget that you, the higher-ups, already know everything important about this business.

“These ‘rules’ reflect pure segmentalism in action—a culture and an attitude that make it unattractive and difficult for people in the organization to take initiative to solve problems and develop innovative solutions… Segmentalist companies may not suffer from a lack of potential innovators so much as from failure to make the power available to those embryonic entrepreneurs that they can use to innovate.

And, …when innovations do occur, segmentalist organizations may not even he able to take advantage of them.”

Rosabeth Moss Kanter,

101.

The Change Masters , 1982, p.

Myth #14: Counting people is easy.

How many people did you serve?

Recidivism or Repeat Customers?

Do unduplicated counts make more sense than duplicated counts? Why?

How do we count level of service?

What if wrap-around services are effective and one-shot taps on the head are not?

What counts as service?

How do we count costs for repeat customers or those that consume more than one menu item?

“Life is not divided into federal block grants.”

Robert Greenstein

Center for Budget and Policy Priorities

NCSL Conference in Burlington, VT

September 1995

Myth #15: We’ve already counted everything that’s important.

People are strange.

Your measures need to capture reality!

All relevant observations must fit somewhere o the measure.

If they don’t, you’re missing reality.

Anomalies are as important as the “normal” observations.

We learn from measurement when they help us see something we would have missed.

Outcome Measures:

Telling the Tale that Wags the Dog?

Is anybody better off?

Is anybody worse off?

How can you tell?

Adapted from Mark Friedman’s Trying Hard is Not

Good Enough.

Download