Joe Adams, Ph.D.
www.joeadams.net
All measurement contains error.
All measures are human creations.
All measures require an observer or instrument user.
Measurement is a discipline.
– What do you see?
– What do you hear?
– How do you look and listen objectivity?
– How do you describe/define the observation?
The Best Measures are Simple
Measures are a shorthand for experience or observations.
Knowing your subject matter counts!
If they can do it, you can too!
Don’t be fooled by naysayers.
Gilley’s song inspired four teams of researchers to test his hypothesis!
And he was almost right!
And so did the research:
– On Attractiveness
– On Mate Selection
– On Stability of Relationships
– On Genetic Cues, etc., etc…
– On a lot of things you really don’t want to know…
“Beauty is in the eye of the beholder!”
The Distorted Cultural Legacy of A.J. Ayer (1910 – 1989)
Language, Truth, and Logic (1936)
The most famous spokesman for the fact/value dichotomy.
Claimed that all statements about values are merely expressions of emotion, with no logical significance.
Also a formidable opponent to Mike Tyson.
Ayer v. Tyson
“[Ayer] taught or lectured several times in the United States, including serving as a visiting professor at Bard College in the fall of 1987. At a party that same year held by fashion designer
Fernando Sanchez, Ayer, then 77, confronted Mike Tyson harassing the (then little-known) model Naomi Campbell. When
Ayer demanded that Tyson stop, the boxer said: "Do you know who the f*** I am? I'm the heavyweight champion of the world," to which Ayer replied: "And I am the former Wykeham
Professor of Logic. We are both pre-eminent in our field. I suggest that we talk about this like rational men". Ayer and
Tyson then began to talk, while Naomi Campbell slipped out.” -
Wikipedia
TKO – First Round!
Verifiable on Wikipedia
“The fact of twilight does not prevent us from distinguishing between day and night.”
Attributed to
Dr. Samuel Johnson (1709-1784)
The Real Issues Are:
Validity and
Reliability
DESIRALBE QUALITIES:
– RELEVANCE: Measures should mean something important to those who use them – performance measures should drive performance!
– PURITY: Measures should deal with a clearly defined domain or dimension of a particular quality.
– REPRESENTATIVENESS: Measures should capture something about a phenomena without distorting the phenomena.
Tend to obscure reality, not illuminate it.
May lead to erroneous, spurious, or absurd conclusions.
In Application of Measures
Internal Threats to Validity
Selection – picking facts that fit hypothesis
History – observations taken at different times
Maturation Effect – subjects or effects mature
Repeated Testing – subjects get test-wise
Instrumentation – “breaks down” or used incorrectly
Experimental Mortality – people drop out
Experimenter Bias – creates expectations
Threats to External Validity
Generalizability of results may be limited by:
– TIME – Sample taken on Fat Tuesday!
– SETTING – During the Superbowl.
– PLACES – As they come out of Sugars…
– PEOPLE (SAMPLE) – Inside Sugars…
– OBSERVER – Barney Fife
Threats to External Validity
(Continued)
Generalizability of results may be limited by:
– Placebo Effect – MSU Health Plan
– Novelty Effect – Ooo wow!
– Hawthorn Effect – More below.
Summary of Validity Issues
Does the measure capture what you intend it to capture.
Artifacts of measurement
Measures that pretend to be one thing, but are actually something else (e.g. pleasing answers).
An artifact might mean that the act of measuring caused something to register that wasn’t there
(e.g. questions about non-existent opinions).
The act of measurement disturbs the same reality it is measuring, a problem commonly known as the Heisenberg Principle (interviewers may make people self-conscious).
The Hawthorn Effect
General Electric plant at Hawthorn Works, outside Chicago in Cicero, Illinois
A series of studies done by Harvard professors between 1924 and 1932.
They were testing hypotheses about working conditions and productivity.
Treatment groups increased productivity regardless of conditions…
Why did they improve?
They felt “special” for being chosen to participate in the experiment.
The experiments spawned the whole Human
Relations school of thought in the field of management.
The Rosenthal Effect
Studies done by Robert Rosenthal and Lenore
Jacobson (1968/1992).
Also called the Pygmalion Effect.
Observer / Teacher expectations improved student results… more than different
“treatments.”
That’s the good news about teaching: It matters.
DESIRALBE QUALITIES:
– ROBUSTNESS: Measures should work well under of variety of extraneous conditions.
– PRECISESNESS: Measures should differentiate between different qualities or gradations.
– SENSITIVITY: Measures should detect change.
Intercoder Reliabilty
Inter-coder or inter-rater reliability : The results of two or more people correlation with each other on a particular item, using the same scale or instrument.
Problem: They see the same thing looking through the same lenses (but they were drunk).
– In the example from the Girls All Get Prettier at
Closing Time, inter-coder reliability on the attractiveness of females typically reaches .90, or 90 percent, depending on how you define reliability.
Most research in this area indicate a high degree of consistency from both sexes. Does drinking help?
Internal Consistency
Internal consistency : The result of one measure correlate with other similar, but different, measures measuring the same thing.
Problem: Error in the measures may be correlated more than the content. It’s the correlation between the measures that is the key to knowing whether the measures are reliable, but that might be a problem:
The observer was drunk again. (GIGO)
Test-retest Reliability
Test-retest reliability : Try measuring the same thing with the same instrument more than once to see if the results are the same.
Problem: The Barney Fife problem – the person using the instrument is part of the instrument (retest won’t catch this).
– Examples: Racial differences between interviewer and subject may shift responses on surveys dealing with race. Male versus female interviewers asking about sexual issues has the same problem.
Split-Half Reliability
Split-half reliability : Use two equivalent forms of a scale to see if they correlate.
Example: Use two different questions in the same survey to measure the same thing. If they are correlated, you’ve demonstrated the reliability of the instrument(s).
Half Goofy: The MMPI
The Minnesota Multiphasic Personality
Inventory (1952 - )
– It’s the pattern, not the questions alone.
– Different axes (dimensions).
– The Diagnostic and Statistical Manual of Mental
Disorders (DSM)
Provides standardized diagnoses.
Describes some treatment protocols
Resources for Testing Validity and
Reliability
G. David Garson, Quantitative Research in
Public Administration http://www2.chass.ncsu.edu/garson/pA765/rel iab.htm
Wikipedia, Validity (Statistics) http://en.wikipedia.org/wiki/Validity_%28statis tics%29
Wikipedia, Validity (Logic) http://en.wikipedia.org/wiki/Validity
How often have you heard:
“Scientific research proves….”
Science does not prove, it disproves.
Key things to understand:
– In science, a null hypothesis is rejected or accepted.
– The outcome of any experiment or statistical comparison counts as only one observation, regardless of the number of data points.
– Different observations at different times may yield different results.
– Eternity is not ours to observe.
Key References
David Hume (1711 – 1776)– Noted that there is nothing logically necessary about the repetition of a pattern continuing in the future.
Ludwig Wittgenstein (1889 – 1951) – Wrote the
Tractatus Logico-Philosophicus, which outlines almost all of the rules of scientific endeavor, one of the most important points of which, is that the notion of causation is a purely intellectual construction and is never a fact.
The Level of Measurement Matters
(A Logical Validity Issue)
Levels of Analysis: Examples
Individual – a person, single cell, atom, e.g. smallest discrete unit.
Group – may meet face-to-face
Organization – does not generally meet face-toface
State – a geopolitical jurisdiction
Nation – Like Texas y’all.
Aggregate measures cannot generally be used to estimate disaggregated behavior.
Conclusions about individual-level behavior cannot be drawn from aggregate comparisons.
Example: Emile Durkheim’s Study of Suicide.
Just because more Bavarians commit suicide,
Catholics are NOT more likely to commit suicide
Disaggregated data cannot generally be used to estimate aggregate behavior.
Conclusions about aggregate behavior cannot be drawn from individual level data.
Example: Hydrogen and Oxygen burn. H2O does not.
Not ALL Texans carry guns and wear cowboy hats.
Not ALL Austinites wear speedos and ride 10speeds downtown.
Maybe Not?
Gary King (1997). A Solution to the Ecological
Inference Problem, Princeton University Press.
Within limits, there may be “probable” statements about inferences between levels. The level of certainty about such statements can be estimated.
http://gking.harvard.edu/stats.shtml
Classic Case:
Attitudes ≠ Behavior
– LaPiere, Richard T. “Attitudes vs. Actions,” Social
Forces, Vol. 13, No. 2. (Dec., 1934), pp. 230-237.
Actual Behavior
Case #2 (1983)
Cenaré Italian Cuisine
404 East University Drive
College Station, Texas
Dr. Robert A. Peterson
Associate Dean for Research at the University of
Texas’ McCombs School of Business
Robert A. Peterson and William R. Wilson (1992). Measuring Customer
Satisfaction: Fact and Artifact, Journal of the Academy of Marketing Science, Vol.
20, No. 1, 61-71.
Customer satisfaction surveys may be measuring how many happy people or unhappy people are in the sample, nothing more.
Developing Measures
“Quantification is merely a second order matching of primary qualities.”
Karl Wolfgang Deutsch (1912-1992)
Three levels of measurement:
–Nominal – The weakest measure
–Ordinal – Mediocre, but not awful.
–Interval/Ratio – The best possible.
Nominal Measures
– Nominal (Categorical) – refers to opaque qualities, color, sex, nationality, groups, etc. Must have no order or rank.
Problem: There might be a hidden order to the measure that is not immediately identifiable, particularly in cases where social status may correlate with other measures (income, education, etc.). The existence of some hidden order is an empirical question that can be tested.
Ordinal Measures
Interval / Ordinal Measures – have direction or dimension, a greater and lesser ends to the measure.
Likert or Guttman Scales, 7-point, 5-point, but no specific distance between points. Example: Scalding, hot, warm, cool, cold, freezing, etc…
Problem: Survey question construction may prompt an order (preference among candidates).
Randomization is a partial remedy.
Interval/Ratio
Interval / Ratio Measures – Most precise kind of measures. The have a constant interval of some kind, admits of degree, gradations, sometimes referred to as a “common metric.”
Problem: Intervals may not be constant (linear). The measures may hide uneven increments. An example is education in years. A year of college is not equal to a year of elementary school
(unless you went to t.u.)
Develop Powerful Measures"!"
The more precise the measure, the more powerful the analytical techniques that can be used
– Nominal: Crosstabs, Chi-square,
– Ordinal: Tau-b, rank order correlation, etc.
– Interval/ratio: Regression, time-series, etc.
Definitions Precision
The precision of the measure depends on two critical items:
– The quality of the definition, and
– The quality of the data collection system.
Parts of a Good Definition
A clear description of the purpose
A clear description of what the measure is supposed to measure
A clear description of how the measure is to be applied, which includes:
– Every step in the data collection process
– A means for identifying error in the collection process (what the measure is not)
An explanation of how the measure will be used.
“There are no facts, only interpretations.”
Friedrich Nietzsche (1844-1900)
Context Matters
What is the theory, hypothesis, or logic model that makes this measure sensible?
Is the measurement tied to a particular problem?
Is the problem an intellectual/academic question or a practical problem requiring a solution?
What question is the measure supposed to answer?
Some call them Paradigms
Concept popularized by Thomas Kuhn in the
Structure of Scientific Revolutions (1962).
The paradigm includes all the methods related to the practice of a scientific endeavor, including the instrumentation and operating assumptions.
Example: “Tell me about your mother…” http://en.wikipedia.org/wiki/Thomas_Samuel_Kuhn
What is your context?
Why do you need to measure something?
– To test a hypothesis?
– To make decisions about agency operations?
– To calculate cost/benefits?
– To demonstrate effectiveness?
– To understand what is happening?
– To find someone to blame?
Theories that Work!
On Good Theories: On the characteristics of a good theory, see the work of Imre Lakatos , especially his book, The Methodology of Scientific
Research Programmes: Philosophical Papers Volume 1
(1977); and Harry G. Frankfurt's On Truth . (See also On Bullshit.)
Good theories exemplify the characteristics of parismony (simplicity, elegance), explanatory power (apply in a wide variety of situations), robustness (they operate in contaminated environments), and empirical support (fit facts
Feeling Good… was good enough for me and Bobby McGee…
Kris Kristofferson
(b. 1936, Brownsville, Texas)
Flow: The Science of Optimal Experience by Mihaly Csikszentmihalyi
Challenges
Flow
Anxiety
Boredom
Skills
The Good Work Project
Recommended Reading: Martin E.P.
Seligman, Authentic Happiness.com (Book
Website)See his What You Can Change and What you Can't and The Optimistic Child ; also see The
Science of Optimism and Hope: Research Essays in
Honor of Martin E. P. Seligman . Mihaly
Csikszentmihalyi's Flow: The Psychology of Optimal
Experience.
Also, see The Good Work Project website for applications of these theories.
“…it is the mark of an educated man to look for precision in each class of things just so far as the nature of the subject admits...”
- Aristotle
Nichomachian Ethics
Special Cases for Estimation
Measures that estimate ranges and compare proportions across two or more dimensions.
Measures that show relationships, trade-offs, and thresholds.
Measures that show what is not seen, residuals.
Flight Envelope Summarizes
Flight envelopes are estimated from available data which show the following characteristics:
– a Take-off speed
– b Stalling speed
– c Ceiling, with corresponding speed
– d Maximum level speed
– d Maximum speed at altitude
– f Maximum sea level speed
Two-dimensions: Flight Envelope
1. Altitude (expressed in ranges)
2. Speed (expressed in ranges)
Comparing Flight Envelopes
1. Combat helicopter (ex. Boeing AH-64 Apache)
2. Cargo aircraft (ex. Lockheed C-130J)
3. Subsonic transport aircraft (ex. Airbus A-300)
4. Supersonic fighter aircraft (ex. Lockheed F-16C) http://www.aerodyn.org/Atm-flight/flimit.html
Measuring Inequality
The Lorenz Curve describes any distribution of a quantity across any population.
– The Gini coefficient provides a global estimation of the degree of inequality within that population.
The Gini Coefficient
Trade-offs
Bounded by a zero point (no trade-off).
Change in A = Change in B = 0
Trade-offs between A & B may occur six ways:
– A increases, B decreases
– B increases, A decreases
– A increases more than B
– B increases more than A
– A decreases more than B
– B decreases more than A
Four Trade-off Conditions
Potential
Trade-offs
A Wins B Wins
Net Increase A > B A < B
Net Decrease A > B A < B
Four Basic Conditions
A Real-Life Measurement Problem
The Mississippi Department of Wildlife,
Fisheries, and Parks has an 8-week backlog in boat registration and sportsman’s licenses.
Delays do not discriminate between individuals, whether they be:
– Farmers
– Bankers
– Legislators, or
– Governors.
Measuring the Unseen
The Sherlock Holmes Approach
“We must fall back upon the old axiom that when all other contingencies fail, whatever remains, however improbable, must be the truth.”
– Sherlock Holmes
The Adventure of the Bruce Partington Plans
(Sir Arthur Conan Doyle)
We’re on the Case!
Whatever is left…
Using residuals to measure something indirectly has been a very useful technique in several arenas.
The Most Famous Example
The Double-Helix of DNA was not observed directly. In essence, Crick and Watson used
Rosalind Franklin’s x-rays of wet and dry strands of DNA.
Essentially, they were looking at the shadow of
DNA, not the DNA itself.
Example 2: Relative Political Capacity
Initial observations:
All political systems must have resources.
Those that are able to obtain resources are stronger than those that cannot.
Wealthier populations are able to pay more taxes than poorer populations.
Some economies are easier to tax than others.
People don’t like to pay taxes, unless they know they’ll get the money back (e.g. Social Security).
Predicted/Model vs. Actual
Observations that fall on the regression line were given a score of 1.00.
Those above were scored as a ratio of their predicted, if double, then 2.00, three times, 3.00 and so on…
Those below their predicted tax rate were given scores from 0 to .99, based on the percentage of the predicted scores.
Results: Uses of RPCs
Explains demographic transitions (population explosions or lack thereof).
Outcomes of wars between relatively equal/uneaqual opponents.
Black market exchange rates for currencies in unstable countries.
Real Men and Women
Use Performance Measures!
(Wennies Don’t)
Performance measures should drive performance.
– There should be thresholds at which management takes action to do something different.
Example: Watch the altimeter for sudden drops, pull up on the yoke if the numbers go down.
– Those actions should be defined in some sort of plan
Example: At 500 feet, eject.
Choosing the Right Comparisons
Myth #9: Collin County Community
College is the perfect peer.
Choose statistical neighbors (like you).
– Comparisons need to make sense.
Choose those with a similar environment.
– Environments need to be “controlled” analytically.
Choose those who differ on performance.
– Variation requires explanation and understanding.
– Lack of variation means nothing can be learned.
Choose those who out-perform the competition.
– That is the “benchmark” to beat.
Include those who do not perform well.
– This avoids the mistake of Tom Peters.
Compare environments, but choose on performance.
Baseline: Compare Trends
Track your own performance over time.
– Identify key internal and external factors.
– Test explanations (hypotheses)
Identify variations.
If there are no variations, you cannot draw any conclusions about causes. A constant explains nothing.
No Variance?
– No chance of improvement
– No Gains
– No Learning
A Costly Example of No Variance
Parties, Ideologies, and Budgets: A Study of
Budget Trade-offs in 18 OECD Countries
– Based on data from 1960 to 1990
– 65,000 cells of data drawn from more than 50 sources, taking six months to enter by hand.
Results for Health vs. Defense
Results for Education vs. Defense
All is not lost
Mona Lisa
“Discovery”
The Case of Texas State Agencies
Reporting is an operational issue.
Alabama SMART Budgeting
Training
Qualities of Good Performance
Measures in the Real World
RELIABILITY
Consistency – Data can be replicated by a competent, trained professional (e.g. Auditor).
Accuracy – The indicators are true to the facts.
VALIDITY
Relevance – Measures relate to progress toward realistic agency/organizational goals.
Usefulness – They provide actionable indicators
Data Integrity Starts with People
Checklist:
Are reporting roles clearly defined?
Is there documentation?
– A ‘paper trail’ for auditing?
– Written procedures for verifying data accuracy?
– Clear responsibility for reviewing and approving performance measure reports?
Is there management ownership for performance measurement reports?
If the answer any of the first five questions is “No.”
Go back to the beginning.
Check every step from start to finish until the error or problem is identified.
If everything checks out, then it is time to look at program operations for answers.*
*
This is a job that is the exclusive responsibility of program management.
Question 6: Identify Root Causes
Is the change in performance the result of an internal or external factor?
– Can the relationship between internal or external factors and performance be demonstrated with data?
Do they correlate?
What are the patterns, trends, etc.?
– What factors can be changed by management?
Can staff, training, technology or funding change the result?
What do data indicate about these connections?
Response to 6: Action Plan
What is required to make change results?
– What new activities will be required to make those changes?
– What resources (or authority) would be required to implement those new activities?
– Who will implement new activities?
– When can the new activities begin?
– How long will it take for the new activities to have an effect?
Tennessee Sour Mash:
Corn and Student Test Scores
The Situation
A University of Tennessee Ag Economics
Professor proposes using crop yield formulas for measuring the “value-added” increases in student test scores.
The Tennessee General Assembly promptly enacts the idea, granting the professor a contract as the sole-source provider, naming him personally in statute (name later removed in the Tennessee Code).
Question:
How do student test scores differ from corn?
Student Test Scores Crop Yields
Photo Credit: Lloyd Wolf/U.S. Census Bureau www.freephoto.com
What Type of Measures Are They?
Nominal?
Ordinal?
Interval?
“INTERVAL”
(BOTH MEASURES)
Corn can always grow taller!
Which School Would Do Better?
Which is more likely to show a gain?
100% 100%
95%
Starting
Points
50%
School A? School B?
“Not everything that counts can be measured, and not everything that can be measured counts”
Albert Einstein (1879-1955)
Before we accept the first premise, we have to ask,
“Have we tried?”
Measuring What is Important
Organizational Culture
– Turnover – Big Clue!
– Absenteeism – Big Clue #2!
– Lack of initiative, passivity – Clue #3
– Low morale – Starting to see a pattern?
– Anger, frustration, discipline problems…
– Sense of hopelessness!!!!
How do we measure this?
Possible Index?
TEN RULES FOR STIFLING INNOVATION
1.
Regard any new idea from below with suspicion—because it’s new, and because it’s from below.
2.
Insist that people who need your approval to act first go through several other levels of management to get their signatures.
3.
Ask departments or individuals to challenge and criticize each other’s proposals. (That saves you the job of deciding; you just pick the survivor.)
4.
Discuss your criticisms freely, and withhold your praise. That keeps people on their toes. Let them know they can be fired at any time.
5.
Treat identification of problems as signs of failure, to discourage people from letting you know when something in their area isn’t working.
Cont…
TEN RULES FOR STIFLING INNOVATION (continued):
6.
Control everything carefully. Make sure people count anything that can be counted, frequently.
7.
Make decisions to reorganize or change policies in secret, arid spring them on people unexpectedly. (That also keeps People on their toes.) Let them know that they can be fired at any time.
8.
Make sure that requests for information are fully justified, and make sure that it is not given out to managers freely. (You don’t want data to fall into the wrong hands.)
9.
Assign to lower-level managers, in the name of delegation and participation responsibilities for figuring out how to cut back, lay oil, move people around, or otherwise implement threatening decisions you have made, and get them to do it quickly.
10.
And above all, never forget that you, the higher-ups, already know everything important about this business.
“These ‘rules’ reflect pure segmentalism in action—a culture and an attitude that make it unattractive and difficult for people in the organization to take initiative to solve problems and develop innovative solutions… Segmentalist companies may not suffer from a lack of potential innovators so much as from failure to make the power available to those embryonic entrepreneurs that they can use to innovate.
And, …when innovations do occur, segmentalist organizations may not even he able to take advantage of them.”
Rosabeth Moss Kanter,
101.
The Change Masters , 1982, p.
How many people did you serve?
Recidivism or Repeat Customers?
Do unduplicated counts make more sense than duplicated counts? Why?
How do we count level of service?
What if wrap-around services are effective and one-shot taps on the head are not?
What counts as service?
How do we count costs for repeat customers or those that consume more than one menu item?
“Life is not divided into federal block grants.”
Robert Greenstein
Center for Budget and Policy Priorities
NCSL Conference in Burlington, VT
September 1995
People are strange.
Your measures need to capture reality!
All relevant observations must fit somewhere o the measure.
If they don’t, you’re missing reality.
Anomalies are as important as the “normal” observations.
We learn from measurement when they help us see something we would have missed.
Outcome Measures:
Telling the Tale that Wags the Dog?
Is anybody better off?
Is anybody worse off?
How can you tell?
Adapted from Mark Friedman’s Trying Hard is Not
Good Enough.