Logical Models and Basic Numeracy in Social Sciences

advertisement
Logical Models and Basic
Numeracy in Social Sciences
http://www.psych.ut.ee/stk/Beginners_Logical_Models.pdf
Rein Taagepera © 2015
REIN TAAGEPERA, professor emeritus at University of California,
Irvine, and University of Tartu (Estonia), is the recipient of the Johan
Skytte Prize in Political Science, 2008. He has 3 research articles in
physics and over 120 in social sciences.
Table of Contents
Preface
A.
Simple Models and Graphing
1.
A Game with Serious Consequences
8
13
A guessing game. Skytte Prize 2008. An ignorance-based logical model.
Exponents.
1a. Professionals Correct Their Own Mistakes
18
Means and median. Do not trust – try to verify by simple means. Professionals correct their own mistakes. But can this be so?
2.
Pictures of Connections: How to Draw Graphs on Regular
Scales
22
Example 1: Number of parties and cabinet duration. Constructing the
framework. Placing data and theoretical curves on graphs. Making sense
of the graph. Example 2: Linear patterns. How to measure the number
of parties.
3.
Science Walks on Two Legs: Observation and Thinking
29
Quantitatively predictive logical models. The invisible gorilla. Gravitation undetected. The gorilla moment for the number of seat-winning
parties. What is “Basic Numeracy”?
4.
The Largest Component: Between Mean and Total
36
Between mean and total. How to express absolute and relative differences. Directional and quantitative models. Connecting the connections.
How long has the geometric mean been invoked in political science?
5.
Forbidden and Allowed Regions: Logarithmic Scales
42
Regular scale and its snags. Logarithmic scale. When numbers multiply,
their logarithms add. Graphing logarithms. Logarithms of numbers
between 1 and 10 – and beyond.
6.
Duration of Cabinets: The Number of Communication
Channels
The number of communication channels among n actors. Average
duration of governmental cabinets. Leap of faith: A major ingredient in
model building. The basic rule of algebra: balancing. Laws and
models.
49
Table of Contents
7.
How to Use Logarithmic Graph Paper
57
Even I can find log2! Placing simple integer values on logarithmic
scale. Fully logarithmic or log-log graph paper. Slopes of straight lines
on log-log paper. Semilog graphs. Why spend so much time on the
cabinet duration data? Regular, semilog and log-log graphs – when to
use which?
B.
Some Basic Formats
8.
Think Inside the Box – The Right Box
66
Always graph the data – and more than the data! Graph the equality
line, if possible. Graph the conceptually allowed area – this is the right
box to think in. The simplest curve joining the anchor points. Support
for Democrats in US states: Problem. Support for Democrats in US
states: Solution
9.
Capitalism and Democracy in a Box
78
Support for democracy and capitalism: How can we get more out of this
graph? Expanding on the democracy-capitalism box. Fitting with fixed
exponent function Y=Xk. Why Y=Xk is simpler than y=a+bx ? Fitting
with fixed exponent function 1-Y=(1-X)k. What is more basic: support
or opposition? Logical formats and logical models. How can we know
that data fits Y=Xk? A box with three anchor points: Seats and votes.
10.
Science Means Connections Among Connections:
Interlocking Relationships
89
Interlocking equations. Connections between constant values in relationships of similar form. Why linear fit lack connecting power. Many
variables are interdependent, not “independent” or “dependent”.
11.
Volatility: A Partly Open Box
96
The logical part of a coarse model for volatility. Make your models as
simple as possible – but no simpler.
Introducing an empirical note into the coarse model. Testing the model
with data. Testing the model for logical consistency.
12.
How to Test Models: Logical testing and Testing with Data
104
Logical testing. Testing with data. Many models become linear when
logarithmic are taken. What can we see in this graph? The tennis match
between data and models. Why would the simplest forms prevail?
What can we see in this graph? – A sample list.
13.
Getting a Feel for Exponentials and Logarithms
Exponents. Fractional exponents of 10. Decimal logarithms. What are
logarithms good for? Logarithms on other bases than 10.
4
113
Table of Contents
14.
When to Fit with What
118
Unbounded field – try linear fit. One quadrant allowed – try fixed
exponent fit. Two quadrants allowed – try exponential fit. How to turn
curves into straight lines. Calculating the parameters of fixed exponent equation in a single quadrant. Calculating the parameters of
exponential equation in two quadrants. Constraints within
quadrants: Two kinds of “drawn-out S” curves.
C.
Interaction of Logical Models and Statistical Approaches
15.
The Basics of Linear Regression and Correlation
Coefficient R2
129
Regression of y on x. Reverse regression of x on y. Directionality of the
two OLS lines: A tall twin’s twin tends to be shorter than her twin. Nontransitivity of OLS regression. Correlation coefficient R2. The R2
measures lack of scatter: But scatter along which line?
16.
Symmetric Regression and its Relationship to R2
142
From minimizing the sum of squares to minimizing the sum of
rectangles. How R-squared connects with the slopes of regression lines.
EXTRA: The mathematics of R2 and the slopes of regression lines.
17.
When is Linear Fit Justified?
149
Many data clouds do not resemble ellipses. Grossly different patterns
can lead to the same regression lines and R2. Sensitivity to outliers.
Empirical configuration and logical constraints.
18.
Federalism in a Box
157
Constitutional rigidity and judicial review. Degree of federalism and
central bank independence. Bicameralism and degree of federalism.
Conversion to scale 0 to 1.
19.
The Importance of Slopes in Model Building
D.
Further Examples and Tools
20.
Interest Pluralism and the Number of Parties:
Exponential Fit
Notation for slopes.Equation for the slope of a parabola – and for y=xk.
Cube root law of assembly sizes: Minimizing communication load. Exponential growth as an ignorance-based model: Slope proportional to size.
Simple logistic model: Stunted exponential. How slopes combine –
evidence for the slope of y=xk. Equations and constraints for some basic
models.
Interest group pluralism. Fitting an exponential curve to interest group
pluralism data. The slope of the exponential equation. Why not fit
with fixed exponent format? EXTRA 1: Why not a fit with fixed exponent? EXTRA 2: Electoral disproportionality.
5
167
181
Table of Contents
21.
Moderate Districts, Extreme Representatives:
Competing Models
192
Graph more than the data. A model based on smooth fit of data to
anchor points. A kinky model based on political polarization. Comparing the smooth and kinky models. The envelopes of the data cloud.
Why are representatives more extreme than their districts?
22.
Centrist Voters, Leftist Elites: Bias Within the Box
202
23.
Medians and Geometric Means
207
A two-humped camel’s median hump is a valley. Arithmetic mean and
normal distribution. Geometric mean and lognormal distribution. The
median is harder to handle than the means. The sticky case of almost
lognormal distributions. How conceptual ranges, means, and forms of
relationships are connected.
24.
Fermi’s Piano Tuners: “Exact” Science and Approximations 215
As exact as possible – and as needed. How many piano tuners? The
range of possible error. Dimensional consistency.
25.
Examples of Models Across Social Sciences
220
Sociology: How many journals for speakers of a language? Political
history: Growth of empires. Demography: World population growth
over one million years. Economics: Trade/GDP ratio.
26.
Comparing Models
229
An attempt at classifying quantitatively predictive logical models and
formats. Some distinctions that may create more quandaries than they
help to solve. Ignorance, except for constraints. Normal distribution –
full ignorance. Lognormal distribution – a single constraint. The mean
of the limits – two constraints. The dual source of exponential format:
Constant rate of change, and constraint on range. Simple logistic
change: Zone in two quadrants. Limits on ranges of two interrelated
factors: Fixed exponent format. Zone in one quadrant: Anchor point
plus floor or ceiling lead to exponential change. Box in one quadrant,
plus two anchor points. Box in one quadrant, plus three anchor points.
Communication channels and their consequences. Population and trade
models. Substantive models: Look for connections among connections.
Appendix A:
What to Look for and Report in Multivariable
Linear Regression
Making use of published multi-variable regression tables: A simple
example. Guarding against colinearity. Running a multi-variable linear
regression. Processing data prior to exploratory regression. Running
exploratory regression. Lumping less significant variables: The need to
report all medians and means. Re-running exploratory regression with
fewer variables. Report the domains of all variables! Graphing the
predictions of the regression equation against the actual outputs. Modeltesting regression. Substantive vs. statistical significance.
6
241
Table of Contents
Appendix B:
Some Exam Questions
255
Appendix C:
An Alternative Introduction to Logical Models: Basic
“Graphacy” in Social Science Model Building
267
References
296
7
Preface
This is a hands-one book that aims to add meaning to statistical approaches, which often are the only quantitative methodology social
science students receive. The scientific method includes stages where
observation and statistical data fitting do not suffice – broader logical
thinking is required. This book offers practical examples of logical
model building, along with basic numeracy needed for this purpose. It
also connects these skills to statistical approaches. I have used drafts of
this book for many years with undergraduate, masters’, and doctoral
students at the University of California, Irvine and the University of
Tartu, Estonia. The topics presented can be largely addressed in one
semester (30 hours of instruction).
Ideally, this book should be taught to beginning college students
who still remember their secondary school mathematics. My dream is a
yearlong course that begins with half a semester using this book (up to
Chapter 16), continues with two half-semesters on statistics, and concludes with the other half of this book. The idea is to start with logical
thinking, leading to transformation of data before applying statistics,
and to conclude again with logical thinking, which adds meaning to
computer print-outs. Those students who prefer a more systematic presentation of models might read Chapter 26 firsts.
A previous book, Making Social Sciences More Scientific: The Need
for Predictive Models (Taagepera 2008), showed why excessive dependence on push-button statistical analysis and satisfaction with merely
directional hypotheses hampers social sciences. The intended readers
were those social scientists who have already been exposed to statistics.
They need not only more skills in logical model construction but also
some deprogramming: reducing overdependence on computer programs. Hence, Making… had its polemical parts, pinning down the
limitations of narrowly descriptive “empirical models”.
The undergraduates who have not been exposed to statistics have
nothing to unlearn. Their basic algebra is relatively fresh, even while
quite a few need help with basic numeracy. Here one can introduce the
beauty of logical models without having to fight the excesses in the use
of statistics. The present book is also more hands-on. Indeed, one may
see some merit in logical model building, yet find it hard to proceed
from philosophical acceptance to actual application. How to begin? This
is often the hardest part. One needs examples on which to practice.
Section A, “Simple Models and Graphing”, introduces simple but
powerful devices for constructing quantitatively predictive models.
Basic mathematical skills such as graphing, geometric means, and loga-
Preface
rithms have to be refreshed and upgraded from memorization to an
operational level. This is what basic numeracy means. My introduction
to logarithms may look simplistic. But I meet too many people who say
they have forgotten all they once knew about logarithms. Forgetting
logarithms is as impossible as forgetting how to ride a bike – if one ever
has truly understood them, rather than merely cramming formulas. So I
try to make the logarithms truly understood. The students are more
motivated to make the effort to understand when they know that this
time it is not math for math’s sake but something that cannot be
avoided, if we want to do meaningful social science.
Section B, “Basic Formats”, develops the crucial ability of knowing
“when to fit with what” – knowing when linear, fixed exponent or
exponential data fit would be the most promising, as a first step. The
fixed exponent format is stressed, with sufficient repetition so that
students may advance beyond the point of using it merely upon
command, like trained dogs. The goal is to build up sufficient understanding and confidence so that students would recognize opportunities
to use their skills in further studies of social phenomena. I introduce
examples from published social science literature, by scholars whom I
highly respect. Even in the best that social sciences offer, modeling and
data analysis can at times be expanded on, for maximal payoff. I thank
the authors of these examples for their kind permission to make use of
their work.
Section C, “Interaction of Logical Models and Statistical Approaches”, makes the connection to statistical approaches, including
symmetric regression, so as to develop ability to use them in proper
conjunction with model building. These topics demand some further
mathematics, including the use of “slopes”, my less threatening way to
slip in elements of differential equations. Some aspects of Sections B
and C have been published separately as “Adding Meaning to Regression” (Taagepera 2010).
Section D presents “Further Examples and Tools”. In particular, it
introduces exponential functions somewhat more systematically. However, my experience is that ability truly to use exponentials requires
more than a few weeks of instruction.
Appendix A offers advice about improved multivariable regression.
This tends to be lost on students who have not reached this topic in
statistics. Appendix B presents some exam questions.
As one reads the chapter titles, no logical sequence may seem to
emerge. Why isn’t there one section called Basic Numeracy, followed
by sections on different types of formats and models? Trouble is that a
sequence that makes sense logically can be inefficient pedagogically.
Cramming all basic numeracy into the beginning is a recipe for
9
Preface
overwhelming students with methodology, the usefulness of which as
yet escapes them. To the contrary, I introduce bits of math only when
one can no longer proceed without them. This approach increases
student motivation and allows the new skills to sink in before something
else begs for attention.
I put much emphasis on graphical representation, both of data and
also more than the data (meaning logical constraints). Social science
papers all too often present data (or merely statistical analysis without
data!) without any attempt at graphing – or presenting only bar graphs,
those half-brothers of data tables. Yet, in all too many cases, the
moment y is graphed against x, configurations jump to the eye that tell
us that standard statistical approaches would be pointless unless the data
are thoughtfully transformed. Ill-applied statistical analysis does not
advance our knowledge of the social world. Using graphs, I gradually
reinforce students’ ability to use formats such as exponential and fixed
exponent – and the ability to recognize when to use such formats! The
more I teach, the more I become aware of small but essential tricks of
the trade that I have taken for granted but are not self-evident to
students and so have to be taught explicitly.
Logical model construction makes use of usual mathematical tools,
but it requires true understanding of those tools, beyond automatic
application of formulas or push-button computation. This is why I stress
the “primitive” stages in logarithms etc., which involve thinking, and
which students often overlook or forget in the rush towards ever more
complex equations. The emphasis is on basic understanding.
Exercises are spiked throughout the text. Asterisks indicate those I
have often assigned as homework to be turned in and corrected, in 9
batches: Exercises 2.1, 2.3 / 4.2, 5.2 / 6.3, 7.1, 7.2 / 7.4, 8.1 /9.3, 9.4 /
10.1, 10.2 / 14.1, 16.2 / 17.2, 18.1, 18.2 / 19.2, 22.1. I do have a set of
brief answers for a part of the exercises, but I am reluctant to distribute
them electronically.
I have been asked about possible follow-up courses. Almost any of
the usual courses on quantitative methods would profit from the basic
skills developed here. A special logical models seminar could be based
on Making Social Sciences More Scientific (Taagepera 2008),
encouraging students to apply model construction to topics of their own
choice. This has been successful with doctoral students at the University
of Tartu.
Many students at the University of Tartu and at the University of
California, Irvine have contributed to the present draft, by their
questions and also by their mistakes when solving the problems presented in exercises. These mistakes showed me where my wordings
were imprecise or more intermediary steps were needed. Many collea10
Preface
gues also have helped, wittingly and unwittingly. Rein Murakas in particular has contributed to formatting.iThe long list of people acknowledged in the Preface of Making…could well be repeated and expanded
on. Special thanks go to colleagues who graciously agreed to have their
graphs subjected to further analysis: Arend Lijphart, Russell J. Dalton
along with Doh Chull Shin, and Richard Johnston along with Michael
G. Hagen and Kathleen H. Jamieson. I still have to ask Pippa Norris and
Ronald Inglehart for permission to include one of their graphs in an
exercise.
Rein Taagepera
rtaagepe@uci.edu
11
A. Simple Models and Graphing
1. A Game with Serious Consequences



In the absence of any further information, the mean of the
conceptual extreme values is often our best guess.
The geometric mean is often preferable to the arithmetic mean.
For the geometric mean of two numbers, multiply them together
and take the square root.
“In the absence of any further information” is a term we will
encounter often in this book. This is the key for building
parsimonious logical models. We may call it the “ignorancebased approach”.
What are “logical models”? What is “numeracy”? Let us start with a
game – a game of guessing. Suppose a representative assembly has one
hundred seats, and they are allocated nationwide. Suppose some proportional representation rule is used. This means that even a party with
1 % votes is assured a seat. The question is: How many parties would
you expect to win seats, on the average?
A guessing game
Think about it for a moment. I repeat:
How many parties would you expect to win at least one seat, out of the
one hundred seats in the assembly?
Would you guess at 2 parties, 5, 10, 20, or 50 parties?
You may protest that you cannot possibly know how many parties get
seats, because I have not given you any information on how the votes
are distributed among parties. If this is the way you think, you are in
good company. For decades, I was stuck at this point.
Now suppose I told you two hundred parties would get seats. Would
you buy that? Most likely, you wouldn’t. You’d protest that this could
not be so when only one hundred seats are available. Fair enough, so
what is the upper limit that is still logically possible? It’s one hundred.
This is not likely to happen, but in principle, 100 parties could win one
seat each.
This is the upper limit. Let’s now consider the opposite extreme.
What is the lower limit? It’s 1. This is not likely to happen either, but in
principle, one party could win all 100 seats.
1. A Game with Serious Consequences
So you did have some information, after all – you knew the lower
and upper limits, beyond which the answer cannot be on logical
grounds. At this point make a guess in the range 1 to 100 and write it
down. Call this number n:
n = the number of parties.
Now, let us proceed systematically. When such conceptual limits are
given, our best guess would be half way between the limits. In the
absence of any further information, nothing else but the mean of the
limits could be justified. However, there are many kinds of means. The
good old arithmetic mean of 1 and 100 would be around 50:(1+100)/
2=50.5. But having 50 parties getting seats would mean that on the
average they would win only two seats each. Only two seats per party?
This might look rather low. If so, then let’s ask which number would
not look too low. We now have a new question:
How many seats would we expect the average party to win when 100
seats are available?
How would we proceed now? Yes, we should again think about
conceptual limits. The average number of seats per party must be at
least 1 (when every party wins only one seat) and at most 100 (when
one party wins all the seats). At this point, make a guess in the range 1
to 100 and write it down. Call this number s:
s = mean number of seats per party.
Now, if we really think that n parties win an average of s seats each,
then the total number of seats (T) must be the product n times s:
T=ns.
Here ns must be 100. Calculate this product for the two guesses we have
written down. If this product isn’t 100, our two guesses do not fit and
must be adjusted. In particular, arithmetic means do not fit. Indeed, 50
parties winning 50 seats each would require 2500 seats – way above the
100 we started with! This approach clearly does not work.
If the product of your two guesses came out as 100 seats, congratulations – your guesses are mutually consistent. But these guesses might
still be less than optimal. Suppose someone guessed at 5 parties winning
an average of 20 seats each, while someone else guessed at 20 parties
winning an average of 5 seats each. What would be the justification for
assuming that there are more parties than seats per party – or vice versa?
14
1. A Game with Serious Consequences
In the absence of any further information on which way the tilt goes,
the neutral assumption is that the two are equal. This means 10 parties
winning an average of 10 seats each.
This is what we call the geometric mean. The geometric mean of
two numbers is the square root of their product. This means that the
square of the geometric mean equals the product of these two numbers.
In the present case, 10 times 10 is the same as 1 times 100.
We’ll see later on (Chapter 23) why we should use the geometric
mean (rather than the arithmetic) whenever we deal with quantities that
logically cannot go negative. This is certainly the case for numbers of
parties or seats.
Do we have data to test the guess that 10 parties might win seats?
Yes, from 1918 to 1952 The Netherlands did have a first chamber of
100 seats, allocated on the basis of nationwide vote shares. Over these 9
elections the number of seat-winning parties ranged widely, from 8 up
to as many as 17. But the geometric mean was 10.3 parties, with an
average of 9.7 seats per party. This is pretty close to 10 parties with an
average of 10 seats. As you see, we could make a prediction with much
less information than you may have thought necessary. And this
approach actually worked!
This is an example of an ignorance-based logical model. It is based
on what may look like nearly complete ignorance. All we knew were
the conceptual limits 1 and 100. I Mind you, we were lucky that The
Netherlands fitted so well. But suppose the actual average had been 8
parties. By guessing 10, we would still have been within 20%. This is
much better than lamely saying: “I don’t know” and give up.
Skytte Prize 2008
Why have I dwelled so long on this simple guessing game? In 2008, I
received the Skytte Prize, one of the highest in political science. And I
basically received it for this guessing game! Oh well, lots of further
work and useful results followed, but the breakthrough moment did
came around 1990, when I was puzzled about the number of seatwinning parties and suddenly told myself: Consider the mean of the
extremes.
Using this approach twice enabled me to calculate the number of
parties in the entire representative assembly when a country allocates
assembly seats in many districts. All I needed was assembly size and the
number of seats allocated in the average electoral district.
In turn, the number of parties could be used to determine the average
duration of governmental cabinets. (This model will be tested with data
15
1. A Game with Serious Consequences
in the next chapter.) Here the logical model is quite different from the
previous – we’ll come to that (Chapter 6). The overall effect is that we
can design for a desired cabinet duration by manipulating the assembly
size and the number of seats allocated in the average district. This is of
practical use, even while the range of error is as yet quite large.
An ignorance-based logical model
We can generalize beyond the limits 1 and 100. This is what science is
about: making broad connections. If something can conceptually range
only from 1 to T, then our best guess is the geometric mean of 1 and T,
which is the square root of T. This square root can also be written as T1/2
or T0.5. Thus, the number of seat-winning parties can be expressed as a
equation:
n=T1/2.
Exponents
Can you see why square root of a number T can be written as T1/2? The
square root of T is a number such that its square is T. Now try to square T1/2.
We get (T1/2)2, which is T1. If your grasp of exponents (like ½ in T1/2) is
shaky, look up Chapter 13. We will have to use exponents over and over.
Even more broadly, when the conceptual limits are minimum m and
maximum M, both positive, then our best guess is the geometric mean
of m and M:
Best guess between limits m and M: g = (mM)1/2.
This is an ignorance-based logical model. It is based on what may
look like nearly complete ignorance. It asks what we can infer “in the
absence of any further information.” This is a term we will encounter
often in this book. This is the key for building logical models that are
“parsimonious”. This means they include as few inputs as possible.
What if you don’t know how to calculate this mysterious (mM)1/2?
Calculate the product mM, and then use the three bears’ approach in
Exercise 1.1 below.
These exercises are interspersed within the text. They may at times
look irrelevant to the topic on hand, yet the basic approach is similar.
This way, they illustrate the wide scope of the particular method.
16
1. A Game with Serious Consequences
Bypass these exercises at your own risk – the risk of thinking that you
have understood the text while actually you have not. Do not just read
them; do them! Without such understanding the following chapters may
become increasingly obscure. The exercises marked with * are especially important.
Exercise 1.1
The three bears tried to figure out the square root of 2013. Papa Bear
offered 100. But 100 times 100 is 10,000 – that’s too much. Mama Bear
offered 10. But 1010=100 – that’s much too little. Little Bear then
proposed 50. Well, 5050=2,500 is getting closer.
a) Without using the square root button of a pocket calculator,
help the bears to determine this square root within ±1. This
means finding a number n such that n2 is less than 2013,
while (n+1)2 is already larger than 2013.
b) What do you think is the broad purpose of this exercise?
Write down your guess, and then compare with suggestions
toward the end of this chapter.
Exercise 1.2
As I was completing secondary school in Marrakech, in 1953, the first
Moroccan uprisings against the French “protectorate” took place in
Casablanca. The French-controlled newspapers wrote that 40 people
were killed. Our Moroccan servant, however, reported rumors that
several thousand were. Take this to mean around 4,000. My friend
Jacques, with family ties in high military circles, asked me to guess how
many people actually were killed, according to classified army reports.
a) Which estimate did I offer, in the absence of any further
information? Write down your reasoned guess.
b) What do you think is the broad purpose of this exercise?
Exercise 1.3
There is an animal they call lmysh in Marrakech. What is your best
guess at roughly how much a lmysh weighs? The only information is
that lmysh is a mammal. The smallest mammal is a shrew (about 3
grams), and the largest is a blue whale (30 metric tons).
a) Convert the weights of shrews and blue whales from grams
and tons into kilograms. 1 ton=1000 kg. 1 gram= 1/1000 kg.
b) Estimate the weight of lmysh.
c) What do you think is the broad purpose of this exercise?
17
1. A Game with Serious Consequences
Each of these exercises also ask you about the broad purpose of this
exercise. The broadest purpose is to give you a feeling of empowerment. Yes, you can figure out many things approximately, and
“approximate” is often sufficient. Dare to muddle through!
If your secondary school math included the exact procedure for
calculating the square root, then you might have received a bad grade if
you used the three bears’ approach, even if you got pretty close to the
correct answer. What is the long-term outcome of the school approach?
Probably: “I have forgotten how to calculate the square root of a
number.” But with the three bears’ approach, can you ever forget it?
The purpose of Exercise 1.1 is not merely to calculate the square root
of a number. It is to encourage relaxed muddling through, more broadly.
And to encourage you to ask, whatever your further employment: Why
is the boss asking me to do this?
It is advisable to ask this question about every exercise in this book,
even while its text does not pose this question: What is the broad
purpose of this exercise?
1a. Professionals Correct Their Own Mistakes


Never believe your answers without asking: “Does it make
sense?” Do not offer an answer you don’t believe in, without
hedging.
The arithmetic mean adds; the geometric mean multiplies. Use
the geometric mean when all components are positive and some
are much larger than some others.
In school, we offer answers to instructors who presumably already
know the correct answer. In later life, this rarely is so. Our boss asks us
to find out about something precisely because she does not know. If our
answer is wrong, there is no one to correct us. But decisions will be
made, partly based on our work, and there will be consequences. Too
many wrong answers will catch up with us.
Means and median
First, some basics about the means need to be reviewed. In our guessing
game we had to use various means for two numbers. We’ll need them
18
1. A Game with Serious Consequences
later on, too – and for more than two numbers. So let us clarify some
terms. As an example, take the numbers 2, 3, 5, 10 and 19.
The median is the point where half the values are smaller and half
are larger. For 2, 3, 5, 10 and 19, it is 5. (If we had 2, 3, 5, 6, 10 and 19,
the median would be (5+6)/2=5.5.) This is what we are after, most
often. But the median is awkward to handle (see Chapter 23), so instead,
we try to use various means.
The arithmetic mean of n values adds the values, then divides by n.
For the 5 numbers above, it is (2+3+5+10+19)/5=39/5=7.8. In other
words, 7.8+7.8+7.8+7.8+7.8 and 2+3+5+10+19 are equal.
The geometric mean, in contrast, multiplies the n values and then
takes the nth root – which is the same as exponent or “power” 1/n. For
the 5 numbers above, (2351019)1/5=5,7001/5=5.64. How did I get
that? On a typical pocket calculator, calculate the product 235
1019, then push the key “yx”, enter “5” then push “1/x” and “=”. In
other words, if 5.64 is the geometric mean of 2, 3, 5, 10 and 19, then
5.645.645.645.645.64 and 2351019 are equal.
When do we use which mean? Broadly speaking, use the geometric
mean when all components are positive and some are much larger than
some others (like 1, 10 and 100). Why? Because in such cases the
median is closer to the geometric mean. Otherwise, use the arithmetic
mean. In the example above, the arithmetic mean (7.8) is higher than the
geometric mean (5.64). Indeed, the arithmetic mean is never smaller
than the geometric and can be much larger.
It was previously said that, over 9 elections in The Netherlands, the
number of seat-winning parties ranged from 8 to 17, but their geometric
mean was 10.3 parties. Now it becomes clearer what this meant. The
arithmetic mean would be somewhat higher than the geometric.
Do not trust – try to verify by simple means
This business of “5,7001/5” in the calculations above may well be new to
you. The instructions on how to calculate it on a pocket calculator
enable you to carry it out, but they do not help you in understanding
what it means. Try to find ways to check whether the outcome makes
sense.
Multiply 5.64 five times by itself and see if you get back 5,700. Grab
your pocket calculator and do it right now! You actually get 5,707 – this
is close enough.
What if your pocket calculator does not have a key “yx”? You can
always use the three bears’ approach (Exercise 1.1). If 55555=
3,125 is a bit too little and 66666=7,776 is much too much,
19
1. A Game with Serious Consequences
compared to 5,700, then try next something between 5 and 6. For most
purposes, you don’t have to be “exact”. Even 5.64 might be more than
you need – 5.6 might well suffice.
Professionals Correct Their Own Mistakes
“We correct our own mistakes. This is what tells us apart from the lab
assistants,” told me Jack Ballou, my first boss at the DuPont Company
Pioneering Lab. “See our helper Bill. He’s real good, one of the best lab
assistants we have. But when he makes mistakes in his calculations, he
accepts the results. Professionals also make mistakes, but they sense
when something is wrong. So it’s part of our job to catch our mistakes
as well as those of people like Bill.”
How do they do it? They never take a numerical result for granted.
They always ask: Does this answer make sense? If it does not, a red
warning light should start blinking in our minds: Perhaps we have made
a mistake. Never believe your answers without checking!
After completing a more complex calculation, repeat it with approximate numbers. Suppose we find that the division 8.4÷2.1 yields 6.3.
Does it make sense? Redo it approximately: 8÷2 is 4. In comparison,
6.3 looks much too large, so something is fishy. Sure enough: Instead of
the division button on the calculator we have pushed the subtraction
button. This happens.
Sometimes closing-the-loop is possible. Take again the division
8.4÷2. If our answer comes out as 6.3, do mentally the approximate
inverse operation – the multiplication: 6.3×2.1 ≈ 6×2=12. (This “≈”
means “approximately equal”.) This 12 is far away from 8.4, so again
6.3 looks suspect.
At other times, the magnitude just does not make sense. Suppose we
calculate the “effective number of parties” (whatever that means) in a
country and get 53. How often do we see a country with 50 parties?
Maybe we have misplaced a decimal, and it should be 5.3 instead?
But can this be so?
Every professional develops her/his own small tricks to check on
calculations – and also on qualitative errors of reasoning! The common
thread is that they draw in some comparison with something else and
then ask: “But can this be so?”
This is what my mother asked me when I was 4. We played at
multiplying by 3: twice 2, 3 times 3, and so on. Suddenly, she jumped
way ahead and asked: “How much is 11 times 3?” This was way out of
20
1. A Game with Serious Consequences
my range. I blurted out 28. She did not correct me. She did not tell me
the right answer – and this is the crux of the story. Instead, she asked:
“How much is 10 times 3” This was easy. I already knew I only had to
add a zero: 30. And then she asked the all-important question: “But how
can 11 times 3 be smaller than 10 times 3?” This is when I became a
scientist. I discovered the power of comparing things and asking “But
can this be so?”
Never believe your answers without asking: “Does it make sense?”
Never offer your professor or your boss an answer you don’t believe in,
without hedging. If you cannot find a mistake, but the answer looks
suspect for whatever reason, tell her so. Imagine your boss making a
decision based on your erroneous number. She may lose money, and
you may lose your job.
This does not mean that common sense is always the supreme
arbitrator. Many physics laws look counterintuitive, at first. Newton’s
First Law says that a body not subject to a force continues to move
forever at constant velocity. OK, push a pencil across the table, and then
stop pushing. The pencil soon stops – no movement at constant speed!
One has to invoke a “frictional force” stopping the movement, and at
first it may look like a pretty lame way out, to save Newton’s skin. Only
gradually, as you consider and test various situations, does it become
evident that it really makes sense to consider friction a force – a passive
one.
In social science, too, some valid results may look counterintuitive.
Don’t reject everything that does not seem to make sense, but ask the
question and double check for possible mistakes.
21
2. Pictures of Connections: How to Draw Graphs
on Regular Scales


Most of what follows in this book involves some graphing. The
deeper our understanding of graphing is, the more we will be
able to understand model construction and testing.
The “effective number” of components such as seat shares of
parties is N=1/Σ(pi2), where pi is the fractional share of the i-th
component and the symbol Σ (sigma) stands for SUM.
Science is largely about making connections between things that can
vary, so that, when knowing the value of one variable, one can deduce
the value of the other. While logical models most often take the shape
of equations (such as p=M1/2), it is often easier to visualize them as
graphs. Hence, most of what follows in this book involves some
graphing. This is why graphing is introduced here early on, so that
possible mistakes can be corrected as soon as possible. The deeper your
understanding of graphing is, the more you will be able to understand
model construction and testing.
We’ll go slowly. Indeed, we’ll go so slow that you may feel like
bypassing the first half of this chapter, thinking that you already know
all that. Don’t bypass it. While reading, try to tell apart things you
know, tidbits that are new and may come handy, and – things you have
been doing while not quite knowing why. (I myself found some of those
while writing this chapter!)
Construct the graphs by hand – NOT by computer. This way you
learn more about the problem on hand, and you do not have to fight the
peculiarities of computer programs. Computer-drawn graphs have their
proper place, but you have to do enough graphing by hand before you
can become the master of computer programs rather than slave to their
quirks and rigidities.
Example 1: Number of parties and cabinet duration
Suppose you are given some data for the effective number of parties in
the assembly (N) and the duration of government cabinets (C) – Table
2.1. (The “effective number” of parties is defined at the end of the
chapter.) What can we tell about the relationship between C and N when
looking at this table? Pretty little. We can see much more when showing
them as a picture – a graph, that is. This is the task in Exercise 2.1.
2. Pictures of Connections: How to Draw Graphs on Regular Scales
Make sure you heed the advice in the sections below (Constructing the
framework, Placing data and theoretical curves, and Making sense of
the graph).
* Exercise 2.1
a) Graph data in Table 2.1, C vs. N.
b) The number of parties cannot be less than 1. So draw a vertical line
at N=1 and mark its left side by shading or slanted lines – this is a
“conceptually forbidden area”.
c) Draw the trend curve, meaning a smooth curve passing in between
the data points.
d) Superimposed on these data points and trend curve (not separately!),
also graph the logical model C=42 years/N2 (which will be presented soon).
e) Compare data, trend curve and model. Does this model express the
general trend of the data points?
f) Also graph the curves C=21 years/N2 and C=84 years/N2 (again
superimposed, not separately!) Describe their locations compared to
that of curve C=42 years/N2 and data points, with special attention
to Greece and Botswana.
g) Try to draw some conclusions: What is the purpose of this exercise,
besides practicing basic graphing?
Table 2.1. Number of parties and cabinet duration. This is a representative sample of 35 countries tested
Country
Botswana
Bahamas
Greece
Colombia
Finland
N
1.35
1.7
2.2
3.3
5.0
C (yrs.)
40
14.9
4.9
4.7
1.3
Exercise 2.2
a) Continuing the previous exercise, calculate the arithmetic mean
for the N values of the 5 countries, and similarly for their C
values. Enter this point (mean N, mean C) on the graph. Where
is it located relative to the curve C=42 years/N2? How come
that the individual countries fit the model better than their
mean? (You may not be able to answer this question, but give it
a try.) zu.sbaoah, ohatlet.zuz
23
2. Pictures of Connections: How to Draw Graphs on Regular Scales
b) Now calculate the geometric means for N and C of these
countries and enter this mean point on the graph. Where is it
located relative to the curve C=42/N2? Compare the locations of
arithmetic and geometric means. zu.arbazuz, sba.zutsõud
Constructing the framework
Try to be precise – otherwise important conclusions may be missed.
Precision includes the following.
A! Get some square paper (or special graph paper with small squares
within larger ones). Make the graphs at least the size of a regular halfpage. Why? Tiny freehand sketches on blank paper are not precise
enough for drawing useful conclusions. (The graphs you see in books
are mostly reduced from larger originals.) Until they receive a bad
grade, one-third of students tend to overlook this requirement.
B! Graph one variable on x-axis and another on the y-axis. Do NOT use
the so-called bar graph format. Why not? Bar graphs hardly convey
more information than a table. The y do not show how one variable
relates to another.
C! By convention, “C vs. N” means that C is on the vertical “y-axis”
and N is on the horizontal “x-axis”. Do not reverse them. Why not? This
is like the convention to drive on the right side of the road – it avoids
confusion. About one-tenth of students tend to overlook this rule, at
first.
D! Inspect the data table to see how much space you need. In Table 2.1,
the y scale must accommodate values from 0 to close to 50 years, to fit
Botswana and Finland. The x scale must accommodate values from 0 to
5, but better include the range 0 to 6, to be on the safe side.
E! On both scales, mark locations at equal distances. For C in Exercise
2.1, one could indicate 0, 10, … 40, 50. For N, 0, 1, … 5, 6 impose
themselves. DO NOT laboriously write in 0, 1, 2, … 50 on the y scale.
Why not? Such crowding blurs the picture.
F! Make sure that intervals between these main divisions include 2, 5 or
10 squares of the squared paper – then you can easily place the data
points precisely. Do NOT use intervals of 3 or 7 squares. Why not? Just
try to place 1.6 precisely, on a scale with 7 squares between “1” and “2”!
24
2. Pictures of Connections: How to Draw Graphs on Regular Scales
G! DO include the point (0,0) in the graph, if at all possible, and DO
label it on both axes with “0” just as you do for “1” or “2”. Zeros are
numbers too, and you hurt their feelings when you omit them. (Indeed,
arithmetic really got going only when the Hindus invented a symbol for
“nothing”, something that eluded the “practical” Romans.)
H! NEVER use unequal interval lengths for equal distances, just
because some intervals have data points and others don’t. (One does
encounter published population graphs where the dates 1950, 1980,
2000, 2005, 2006 and 2007 are shown at equal intervals. Population
growth seems to slow down, even when it actually does not. This is
lying with graphs, intentional or not.)
I! Label the axes. In the present case show “Number of parties (N)”
along the x axis, and “Cabinet Duration, years (C)” along the y axis. Do
not forget “years” – all durations would be 12 times larger, if months
were used!
J! When drawing straight lines (such as y and x axes), use a ruler – do
not do it freehand. You don’t have a ruler handy? Of course you do. Use
the edge of a book or writing bloc.
Placing data points and theoretical curves on graphs
K! Indicate data points by small dots at the precise location, complemented by a larger mark around it – an “O” or an “X”, etc. Do NOT use
just a small dot, which looks like an accidental speck, or a blurb the
center of which is fuzzy.
L! Do NOT connect the data points with a zigzag line. Instead, show
the general trend, a smooth curve or straight line passing in the middle
of the cloud of data points. Why? There is “noise” – random divergence
of data points from the general trend. What interests us in the first pace
is this general trend. One-third of the students tend to play such a
game of “connect-the-dots”. I understand – I remember how I did so
in my first high school physics lab .
M! For a theoretical curve such as C=42 years/N2, calculate C for
simple values of N such as 1, 2, 4, 6. Mark them on the graph with tiny
symbols, quite different from those for data points. Draw a smooth
curve through them. Do NOT make life hard for yourself by calculating
C at N=1.35 or 1.7 just because Botswana and Bahamas happen to have
25
2. Pictures of Connections: How to Draw Graphs on Regular Scales
these values. But you might need to calculate C at N=1.5 because the
curve drops steeply between N=1 and N=2.
N! Before drawing the curve, turn your graph on the side so that the y
scale is on top. Your natural wrist movement readily allows you to join
the theoretical points smoothly.
Making sense of the graph
We have constructed the graph. Now comes the main part: Making
sense of it. In the present case, the main question is: Does the cloud of
data points agree with the theoretical curve (the model)? When can we
say that this is the case?
First, there should be a roughly equal number of points above and
below the curve. If most data points lie above the theoretical curve, the
model cannot be said to fit. But even more is needed. There should be a
roughly equal number of points above and below the curve at both
extremes of the curve. If all data points were too high at one end and too
low at the other, then some other curve would fit better than the curve
proposed.
This is coarse advice, but fairly foolproof.
In the actual case in Exercise 2.1, some points are above and some
below the theoretical curve – so the model may well hold, with some
random scatter. But mere 5 data points usually do not suffice to test a
model.
Example 2: Linear patterns
Now consider the data in Table 2.2, which applies to the previous 5
countries. What x and y stand for will be explained later on. What can
we tell about the relationship between x and y?
Table 2.2. Some data on 5 countries.
Country
Botswana
Bahamas
Greece
Colombia
Finland
x
0.130
0.230
0.342
0.519
0.699
26
y
1.602
1.173
0.690
0. 672
0.114
2. Pictures of Connections: How to Draw Graphs on Regular Scales
* Exercise 2.3
a) Graph y vs. x for countries in Table 2.2. Use the same scale on
both axes (i.e., same length for distance from 0 to 1). Make sure
your axes cross at (0,0).
b) Comment on the shape of the resulting pattern.
c) Use a ruler, preferably a transparent one, so you can see all the
data points. By eye, draw the best-fit line through these points.
This means that there should be about the same number of
points above and below the line. Moreover, such balance should
hold both at low and at high ends of the data cloud.
d) Any straight line has the form y=a+bx. Determine the constants
a and b for this particular line. How is this done? See Figure
2.1. Write out the equation specific to this particular straight
line; this means plugging in the numerical values of a and b into
y=…+…x. oah.ot, egytmenhet
e) According to a logical model, the line y=1.62-2x is expected.
Graph it superimposed on the previous (not separately!). To do
so, plug 3 simple values of x into y=1.62-2x – like x=0, 0.5, and
1 – and calculate the resulting y. If the 3 points do not fall
perfectly on a line, you’ll know a mistake has been made.
f) How close is your visually drawn straight line fit to the model?
In particular, how close is your best-fit slope to -2, the slope
predicted by the model?
Figure 2.1. For the line y=a+bx, “intercept” a is the left side of the
triangle formed by this line and the point (0;0). The slope b is the ratio
of lengths of the left and bottom sides. Since the line in this figure
slopes downward, the slope must carry a negative sign.
27
2. Pictures of Connections: How to Draw Graphs on Regular Scales
How to measure the number of parties
We usually face a mix of larger and smaller parties, whether we consider
their vote shares or seat shares. How do we measure the “real” number of
parties when some are large and some are negligible? Most often the
following “effective number of components” is used:
N = 1/Σ(pi2).
Here pi is the fractional share of the i-th component and the symbol Σ
(sigma) stands for SUM. What does this mean? Suppose the seat shares of 4
parties are 40, 30, 20 and 10, for a total of 100. The “fractional shares” are
those numbers divided by the total. Then N=1/(0.402+0.302+0.202+0.102)=
1/0.30=3.3. There are less than 4 but more than 3 serious parties, sort of.
This “Laakso-Taagepera effective number” is never larger than the total
number of parties (which here is 4). The values of N shown in this chapter
are effective numbers of assembly parties, i.e., numbers based on seats. For
other purposes, one might use vote shares.
Exercise 2.4
a) Guess at the effective number when the seat shares of parties are 45,
35, 10, 9 and 1. Write it down. Then calculate it. Compare the result
to the guess.
b) Do the same when the numbers of seats for three parties are 100, 80
and 20, respectively. (CAUTION: To get the fractional shares, one
first has to divide by the total number of seats.)
28
3. Science Walks on Two Legs: Observation and
Thinking






In addition to asking how things are, we must also ask how they
should be, on logical grounds. Science walks on these two legs.
We largely see only what we look for.
Models based on (almost complete) ignorance are only one of
many different types of logical models.
Make the models as simple as possible.
Quantitative models have more predictive power than directional models.
Knowing how to do something is of little use when one does not
know when the time has come to make use of one’s skills.
This book is about social science – the approach used in science,
applied to things social. Science is more than just learning facts. A
person who has memorized the encyclopedia and quotes from it is not a
scientist. Science deals with making connections among separate pieces
of knowledge. Making connections among known facts can lead to new
questions and new, previously unexpected vistas. Connections can be
expressed in words, but they are more precise when they can be
expressed in equations.
Four blind men tried to figure out what an elephant was, by touching
it. The one who happened to grope around a leg concluded elephant was
a tree trunk. Oh well, you know the story… except my way to end it.
Only three of them began to argue with each other, insisting on their
own individual truths. The fourth person thought: These are reasonable
people who know what they have encountered. Let me try to construct a
picture that connects all of their observations. What he got was nothing
like an elephant. It was a tree with a hose hanging from it, and so on.
This description fitted the known facts, but it did not make much sense.
So he kept on thinking. Finally, some detail in the description of the
elephant’s trunk made him hit on a broad idea, a logical model far
beyond the details described: This must be a live animal! Then
everything began to fall in place. Well, almost everything. This person
was a scientist: He tried to connect facts to each other – and then to
connect the various connections.
Science walks on two legs – see Figure 3.1. One leg refers to the
question: How things are? It leads to careful observation, description,
measurement, and statistical analysis. The other leg refers to the question:
How things should be, on logical grounds? That question guides the first
3. Science Walks on Two Legs: Observation and Thinking
one. The question “How things are?” assumes that we know which
aspects of things are worth paying attention to. But we largely see only
what we look for. And it’s the question “How things should be?” that
tells us what to look for. This is the question we asked about the number
of seat-winning parties – even while it may not look so.
That science walks on two legs is a notion as old as social science.
Auguste Comte, one of the initiators of social studies, put it as follows,
two centuries ago, in his Plan of Scientific Studies Necessary for
Reorganization of Society:
If it is true that every theory must be based upon observed facts,
it is equally true that facts cannot be observed without the
guidance of some theory. Without such guidance, our facts
would be desultory and fruitless; we could not retain them: for
the most part we could not even perceive them. (As quoted in
Stein 2008: 30–31)
We largely see only what we look for. In this quote the logical model
("theory") may seem to come first, but actually a continuous interaction
is meant: “some theory” as guidance, some observation, some further
model refinement… The chicken and the egg evolve conjointly.
Figure 3.1. Science walks on two legs: Observation and Thinking
SCIENCE
How
things
How
things
Statistical testing
of quantitatively
predictive logical
models
ARE
Empirical
relationships
↑
Data analysis -statistical etc.
↑
 Data
Measurement
↑
SHOULD
BE on
Quantitatively
predictive logical
models
↑
Thinking
↑
Directional
prediction
↑
Observation
Thinking
30
logical
grounds
3. Science Walks on Two Legs: Observation and Thinking
Quantitatively predictive logical models
What are “logical models”? Now that we have an example in developing p=M1/2, the following answer may make sense to you. A logical
model is a model that one can construct without any data input – or
almost so. Just consider how things should be or, as importantly, how
they cannot possibly be. In the previous case, we had no data, just the
knowledge that the answer could not be less than 1 or more than 100.
Furthermore, we should aim at “quantitatively predictive logical
models”. This means that such logical models enable us to predict in a
quantitative way. In the previous model the prediction was not just a
vague “between 1 and 100” but a more specific “around 10”. “Quantitative” does not have to mean “exact” – just much more specific than
“between 1 and 100”.
Logical models should not be believed in. They must be tested. In
the preceding chapter, we compared with some data on elections in The
Netherlands. This is not enough. This was merely an illustration,
showing that the model might work. Much more testing was needed to
make the model credible.
How does one go on to build logical models? One learns it best by
doing, because it’s an art. Each situation requires a different model. The
previous chapter used a model of (almost complete) ignorance, but this
is only one of many approaches. If there is one general advice, it is:
Make it as simple as possible. (Albert Einstein reputedly added “and
no simpler”, but let’s leave that for later.) For Stephen Hawking (2010),
a good model has the following features:
 It is elegant. (This means simplicity, symmetry…)
 It has few arbitrary or adjustable components – it is parsimonious.
 It generalizes, explaining all available observations.
 It enables one to make detailed predictions; the actual data
may possibly contradict these predictions and thus lead to a
revision of the model.
Some of the greatest truths in life and science are very simple. Indeed,
they are so simple that we may overlook them. And even when pointed
out to us, we may refuse to accept them, because we say: “It cannot be
that simple.” (Maybe you had this reaction to p=M1/2.) Sometimes it can
be simple. This does not mean that it is simple to find simple truths.
Moreover, combining simple building blocks can lead to quite complex
constructions. So, when we think we have a model, simple or complex,
we should verify whether it really holds. This is what testing a model
means, in science.
31
3. Science Walks on Two Legs: Observation and Thinking
The invisible gorilla
We largely see only what we look for. A gorilla taught me so. At a
science education conference in York we were shown a film clip.
People were playing basketball, and we were instructed to count the
number of passes. The action was too fast for me. I soon gave up
counting and just watched idly, waiting for the end. Thereafter we were
asked if we had noticed anything special. A few laughed knowingly and
shouted: “The gorilla!”
Oh my! I vaguely recalled reading something about a gorilla experiment. Was I among the ones taken in? I surely was. The clip was run
again. While the game went on in the background, a person in gorilla
suit slowly walked across in the foreground. Center stage it stopped,
turned and looked at us. Then he continued at a slow pace and exited. It
was as plain as anything could be, once we were given a hint – but
without such a hint most of us had not seen him!
In science, the word “should” – as in “How things should be?” – is
often the gorilla word. We see only what we think we should look for. If
science were reduced to supposedly hard-boiled observation and
analysis of facts but nothing else, we might improve our ability to count
the passes while still missing the gorilla. The following test illustrates it.
Gravitation undetected
I sent some 35 social scientists data where the output was calculated
exactly from the formula for the universal law of gravitation – but I
didn’t tell them. The law is F=GMm/r2 – force of attraction F between
two bodies is proportional to their masses (M and m) and inversely
proportional to the square of their distance (r). G is a universal constant.
I simply sent my colleagues a table of values of y, x1, x2, and x3, and told
them that y might depend on the other variables. Where did I get the xvalues? I used a telephone book to pick essentially random values. Then
y came from y=980x1x3/x22.
What was the purpose of this experiment? If data analysis sufficed to
detect how things are connected, some of my colleagues should have
found the actual form of the relationship. All of them were highly
competent in data analysis. Social data usually comes with large random
variation, which makes detection of regularities so much more difficult.
My pseudo-data had no such scatter. Yet no one found the form of the
relationship. They tried out standard formulas used by statisticians,
usually of the type y=a+bx1-cx2+dx3. This linear expression uses
addition or subtraction, while the actual equation involves multiplication and division.
32
3. Science Walks on Two Legs: Observation and Thinking
If only they would have found no connection! But it was worse than
that. All those who responded found quite satisfactory results by the
usual criteria of data analysis. Why was it worse than finding nothing?
If we do not get a satisfactory result, we may keep on working. But if
we get a result that looks satisfactory, yet is off the mark, then we stop
struggling to find anything better. I cannot blame my colleagues for not
detecting the law of gravity – I only gave them plenty of “what is” but
no clue about “what should be”. Small wonder they missed it. However,
the “should” part is underdeveloped and underestimated in today’s
social sciences.
While science in general walks on two legs, today’s social sciences
show a tendency to hop on one leg (Figure 3.2). “How things should be
on logical grounds” tends to be reduced to directional models, such as
“If the number of seats available increases, then so does the number of
parties represented”. In shorter notation: “M up  p up.” How much
up? The directional model does not predict it. But with a little thinking,
we could propose p=M1/2, which not only includes “M up  p up” but
also does offer a specific quantitative prediction. Quantitative models
have more predictive power than directional models.
The validity of a logical model must be tested – this is where
statistics comes in. But without a logical model there is nothing to test.
We would just be measuring.
Figure 3.2. Today’s social science tends to hop on one leg, Observation.
Today’s
SOCIAL
SCIENCE
How
things
Statistical testing
of directional
prediction
ARE
Empirical
relationships
↑
Data analysis -statistical
↑
 Data
Measurement
↑
How things
SHOULD BE on
logical grounds
Directional
prediction
↑
Thinking
Observation
33
3. Science Walks on Two Legs: Observation and Thinking
The gorilla moment for the number of seat-winning parties
Of course, the distinction between “is” and “should” isn’t always so
clean cut. Most often they enter intermixed. Consider how the number
of seat-winning parties eluded me (and everyone else) for decades.
The obvious part was that this number depends on how people vote –
but pinning down typical votes patterns was even harder than counting
the number of passes in a basketball game. In hindsight, this difficulty
applied only to districts with many seats. In one-seat districts obviously
one and only one party would win the single seat available, regardless
of how many parties competed and how people voted. This observation
looked trivial and not even worth stating – and this was the problem. It
was “obviously” irrelevant to the puzzle in multi-seat districts.
But suppose someone had spelled out the following: “In a one-seat
district, the number of seat-winning parties is one, regardless of how
voters vote.” It would have been like someone shouting: “The gorilla!”
Indeed, if the way voters vote is completely overridden by something
else in one-seat districts, could it be the same in multi-seat districts, at
least partially? We are directed to see that the total number of seats
available should matter, at least on the average. The rest was easy.
When did the shift take place, from fact-oriented “is” toward
“should”? It’s hard to say. The observation that a district of 100 seats
offers room for more parties than a district of 10 seats or a district of 1
seat is sort of factual. Yet it brings in a fact that previously was thought
irrelevant. It supplied the jumping board for the first sentence above
where the word “should” explicitly enters: “The total number of seats
available should matter.” Moreover, don’t overlook the expression “on
the average”. Dwarfs and giants may occur, but first try to pin down the
usual.
34
3. Science Walks on Two Legs: Observation and Thinking
What Is “Basic Numeracy”?
Literacy means ability to read and write – and to understand what one reads.
The term “numeracy” has been coined to mean the ability to handle
numbers. Does one have some sense of the distance between New York and
Los Angeles or Paris and Berlin? Can one say whether a per capita Gross
Domestic Product of 10,000 dollars or euros feels large or small? Can one
carry out simple mathematical operations when ordered to do so? And if so,
does one have a feeling for what the calculated result means or implies for
some substantive purpose? For this purpose, does one know when to carry
out simple mathematical operations, without someone else giving an order to
do so? The latter is critical. Knowing how to do something is of little use
when one does not know when the time has come to make use one’s skills.
We have taken steps in this direction. “Does this answer make sense?” is
a basic criterion for numeracy. “But this is what my computer program
prints out” is a poor substitute. Figuring out approximate answers to
problems is another aspect of basic numeracy. Desperately trying to figure
out which ready-made formula to pick and plug in is a poor substitute.
Almost everyone profits from basic numeracy. Ability to make one’s
personal financial decisions calls for arithmetic operations, operations with
percentages, and a general sense of sizes and distances. What is part of
further basic numeracy varies, when it comes to specific occupations.
This book will not try to define what kind of numeracy should be needed
in social sciences as such. Its prime purpose is to develop skills in
constructing logical models. So it introduces only whatever aspects of
numeracy are needed, for this purpose. It does so at the point when they are
needed and only to the degree they are needed. Introducing the basics of
means and the median in the first chapter is an example – it was needed right
then. Only one aspect of numeracy is so important for logical model
building that it was introduced before the need became obvious: constructing graphs y vs. x. While needed for model building, most of these
skills are also needed for social sciences more broadly.
35
4. The Largest Component: Between Mean and
Total




The share of the largest component is often close to the total size
divided by the square root of the number of components. This is a
quantitative model.
In contrast, “The largest share tends to go down when the number
of components increases” is a directional model.
We should try to go beyond directional models, because quantitative models have vastly more predictive power.
To express closeness of estimates, use relative differences rather
than absolute differences. When relative differences are large, it is
more meaningful to express relative error by multiplicative
factors, such as ×÷2 (“multiply or divide by 2”) rather than
percent differences, such as ±50%.
The United States has about 300 million people (312 million in 2011)
divided among 50 states. (We make it simple and ignore DC and Puerto
Rico.) What could be the population of the largest state? Write down
your gut-level guess. Call it P1.
The US also has an area of 10 million square kilometers (9.88, if one
wishes to be overly precise). What could be the area of the largest state?
Write down your gut-level guess. Call it A1.
In Chapter 1 we found that in a district with 100 seats we can expect
about 10 parties to be represented. How many seats might go to the
largest of these parties? Write down your gut-level guess. Call it S1.
Between mean and total
Once again, our first reaction might be that one cannot know, short of
running for an almanac. But think about the conceptual limits. Obviously, the largest state cannot exceed the total. But what is the least size it
could have? Stop and think.
Here the number of states enters. If all states had the same population and area, each would have 300/50=6 million people and 10/50=
0.2 million square kilometers. If they are unequal, the largest state must
have more than that. So we know that
6 million < P1 < 300 million, and
0.2 millions km2 < A1 < 10 million km2.
(Note that the largest state by area need not be the largest by population!)
4. The Largest Component: Between Mean and Total
In the absence of any other information, our best estimate is the mean
of the extremes. This is another ignorance-based logical model. Since
population and area cannot go negative and vary hugely, the geometric
mean applies. The result is P1=42 million and A1=1.4 million km2.
Actually, the most populous state (California, 38 million in 2011) was
smaller than we guessed, while the largest by area (Alaska, 1.7 million
km2) was larger. Yet, both estimates are within 25% of the actual
figures. We may also say that it is within a factor of 1.25 – meaning
“multiply or divide the actual value by 1.25”. Now compare your gutlevel guesses with the figures above. Did you get it as close as the
geometric mean of mean size and the total?
Exercise 4.1
Note that we made use of both means. First, 6 million was the
arithmetic mean size of the 50 states. Then we took the geometric mean
of mean size and the total. Why? OK, try to do it the other way round.
To obtain 50 equal-sized states, we have to divide the total by 50,
meaning the arithmetic mean. There is no other way. But we can take
the arithmetic means of 6 and 300. Do it and see what population it
would predict for California. Do the same for area. Roughly how large
would the largest component be, compared to the total? How much
would it leave to al the other states combined? Would it agree with your
gut feeling?
We can repeat the estimates for Canada or Australia, and the same error
limits hold. This approach fails, however, when the federal units are
purposefully made fairly equal (such as in Austria) or a large state has
later added small satellites (Prussia within Imperial Germany, Russia
within the Soviet Union). On the other hand, the approach applies much
wider than to sizes of federal units. In particular, it applies to parties.
If 100 seats are distributed among 10 parties, their mean share is 10
seats. The largest party can have as few as 10 and as many as (almost)
100 seats. Hence our best guess is (100x10)1/ 2=31.6 seats. In the
aforementioned case of The Netherlands 1918-1952, the actual largest
share over 9 elections ranged from 28 to 32, with an average of 30.6.
Compare this to your gut-level guess at the start of the chapter.
The fit isn’t always that good, but the estimate works as the average
over many countries. Indeed, this relationship was a major link in my
series of models leading to prediction of cabinet duration. It may be
useful for some other sociopolitical problems.
Up to now, we have worked out individual cases. It’s time to
establish the general formula. When a total size T is divided among n
37
4. The Largest Component: Between Mean and Total
components, then the largest component (S1) can range from T/n to T.
The best estimate for S1 is the geometric mean of these extremes: S1 =
[(T/n)T]1/2 = [T2/n]1/2 = T/n1/2. In short:
S1 = T/n1/2.
The largest component is often close to the total size divided by the
square root of the number of components. The relative share of the
largest component, out of the total, is s1=S1/T. Then we get an even
simpler relationship, and we can also express it in percent:
s1 = 1/n1/2=100%/ n1/2.
* Exercise 4.2
Australia had a population of 22.5 million in 2011. How large would
you expect the population of its largest federal component to be? Your
guess would depend on the number of components, but there is a snag.
Australia has 6 states plus a large and empty Northern Territory with
limited self-rule and also a separate Capital Territory (Canberra).
Should we base our estimate on 6 or 8 components?
a) Estimate the largest component both for n=6 and n=8.
b) The actual figure for the most populous component (New
South Wales) was 7.2 million in 2011. By how much are your
two estimates off? (See next section, before answering.)
c) Draw some general conclusions. (Students often too quickly
conclude that what works better in one particular case is
preferable in general. Keep random variation in mind. The US
case does not imply that S1 = T/n1/2 always overestimates
populations but underestimates areas of largest subunits!)
How to express absolute and relative differences
Suppose we overestimate the weight of an animal by 2 kilograms (about
5 lbs.). The absolute difference is 2 kilograms. Is this little or much? It
depends. For an elephant, it would be a remarkably good estimate. For a
mouse, it would be unbelievably off.
So we are better off using the relative difference. When relative
differences are small, it is convenient to express them in percent. When
our estimate is 500 while the actual number is 525, then the percent
difference is [(525-500)/525]100%=5%. But when the differences are
large, then we run into trouble.
38
4. The Largest Component: Between Mean and Total
Suppose your boss tells you that business is bad and your salary
must be cut by 50%. “But don’t worry. When business gets better, I’ll
increase your salary by 50%.” If you buy that, you are in for permanent
salary reduction. Indeed, suppose your initial alary was 1000 euros. A
50% cut will take it to 500 euros. Increase 500 by 50%, and you end up
with 750.
Percent differences lack symmetry: They cannot be lower than
-100%, but they can be very much higher than +100%. Hence a percent
error of ±5% is fairly clear, but an error of ±50% becomes ambiguous.
Use percent difference or error only when it is much less than 50%.
When relative differences are large, it is more meaningful to express
them by multiplicative factors, such as ×÷2 (“multiply or divide by 2”).
Being low by a factor of 2 is the same as being low by 50%. But being
high by a factor of 2 is the same as being high by 100%. If your boss
cuts your salary by a factor of two and later increases it by a factor of
two, then you do break even. This is a marked advantage.
By analogy with the widely used notation “±”, I’ll use “×÷”
(“multiply or divide”) for differences “by a factor of”. But be aware that
this is not a widely practiced notation – most people will not recognize
it, unless you explain it.
Directional and quantitative models
This difference was pointed out in the previous chapter. Let us apply it
here. A logical model can be merely directional, such as “If the number
of components goes up, the share of the largest component goes down.”
But how fast does it go down? We get much more out of a quantitative
model such as S1 = T/n1/2. It tells us everything the directional model
does, but also much more. For instance, if n increases 4-fold, then S1 is
reduced by one-half. Compare this to the directional statement: “If n
increases 4-fold, then S1 is reduced”. By how much? It does not tell us.
We should always aim at going beyond a directional model, because a
quantitative model has vastly more predictive power.
Some people think that every research project must start with a
“hypothesis”. For the directional model, this would be the same as the
model: “If the number of components goes up, the share of the largest
component goes down.” But what could be the starting “hypothesis”
when we aim at a quantitative model? We do not know its shape, ahead
of the time. So we can only run the stilted hypothesis: “There is some
quantitative relationship between the number of components and the
share of the largest component – and if I try, I can find it.” It is less
stilted to replace the “research hypothesis” by the research question:
39
4. The Largest Component: Between Mean and Total
“What is the quantitative relationship between the number of components and the share of the largest component?”
Connecting the connections
Science deals with making connections among facts, but it aims at more
than that: It tries to make connections among those connections. What
does this mean?
The model S1=T/n1/2 predicts that with 100 seats and 10 parties, the
largest party share would be around 100/101/2=31.6. In Chapter 1 we
established the model n=T1/2 for parties. These two models interconnect.
Indeed, we can “plug” n=T1/2 into S1=S/n1/2. This means replacing n by
its equivalent, T1/2:
S1=T/n1/2 = T/(T1/2)1/2 = T/T1/4 = T3/4.
(If you feel uneasy about this T/T1/4=T3/4, take a look at Exponents in
Chapter 13.) In short, when T seats are allocated by proportional
representation, the largest party can be expected to win T3/4 seats:
S1= T3/4.
Let us apply it to T=100:1003/4=31.6 – the same result we obtained
when going by two separate stages.
This is what “making connection between connections” means.
Symbolically, we have more than two isolated connections, Tn and (T
and n)S1 – in the case of parties, they connect into TS1. We have
interlocking models. This is what makes science powerful.
40
4. The Largest Component: Between Mean and Total
How long has the geometric mean been invoked in political science?
At least for 250 years. Jean-Jacques Rousseau (1762) considered the government as an intermediary power between the people as the source of law (“the
sovereign”) and the people as individuals, subjects to law (“the people”):
This last relation can be depicted as one between the first and last
terms of a geometric progression, of which the geometric mean is
the government. The government receives from the sovereign the
orders which it gives to the people; and if the state is to be well
balanced, it is necessary, all things being weighed, that the product
of the power of the government multiplied by itself should equal
the product of the power of the citizens who are sovereign in one
sense and subjects in another. (Rousseau, The Social Contract,
Book III, Chapter I “On Government in General” 1762/1968)
Thereafter, Rousseau plays with the idea of the power of the people as a
whole corresponding to their number, say 10,000 citizens, while an
individual is one. But then he veers off into generalities, shirking away from
stating that the geometric mean is 100. He probably sensed that this “100”
would raise questions he could not answer. Would it mean having a
governmental payroll of 100 people, for such a small republic?
Exercise 4.3
Let us continue to play with the idea Rousseau fleetingly suggests. How
large a “government” would the square root of population suggest for
New South Wales, Australia, California, or the United States? How do
those numbers compare with the actual number of employees in the
public sector, at all levels?
The broader question raised here is the following: Do larger units
have a larger percentage of the work force in the public sector
(directional model), and if so, what is the relationship to the population
(quantitative model)?
41
5. Forbidden and Allowed Regions: Logarithmic
Scales






Some values for a quantity may be forbidden on conceptual
grounds, or on grounds of factual knowledge. What remains is the
allowed region.
The center of the allowed region is often easier to visualize when
showing the regions on a logarithmic scale.
The logarithmic scale shows 0.1, 1, 10 and 100 at equal intervals.
The (decimal) logarithm of a number such as 10,000 – “1”
followed by zeros – is simply the number of zeros.
When numbers multiply, their logarithms add: log(ab)=loga+log
b.
It follows that log(1/b)=-logb, and log(an)=nloga.
In all the previous examples, we could represent our options on an axis
ranging from minus to plus infinity (-∞ to +∞). How did we narrow
down the options, say, for the weight of lmysh in Exercise 1.3? Let us
proceed by systematic elimination, as shown in Figure 5.1.
Regular scale and its snags
Negative weights are inconceivable, even in fairy tales. Therefore, we
should first mark off the negative part of the scale as an utterly forbidden
region for any weights whatsoever. Thereafter, we could mark the
weights of shrews and blue whales and exclude what’s outside this range
as forbidden regions for mammals. This would leave the allowed region,
between Shrew and Whale. In Figure 5.1, the forbidden regions are
dashed off, so that the allowed region stands out. The dashing is heavier
for utterly forbidden regions, to visualize a stronger degree of exclusion.
Figure 5.1. Weights of mammals on an almost even scale: Allowed
region.
5. Forbidden and Allowed Regions: Logarithmic Scales
All that’s left is to make the best estimate within this region. This is still
quite a job, but without pinning down the allowed region, we could not
even start. Within the allowed region we could still distinguish a central
region, where we are most likely to expect to find most mammals, from
cats to horses, and marginal regions of surprise, where we would be
surprised to find lmysh, if we know that few mammals are as small as
shrews or as large as whales.
The scheme above is not drawn to an even scale. If we try to redraw
it truly to scale (Figure 5.2), we run into double trouble. First, compared to the weight of the whale, the weight of the shrew is so tiny that
it fuses with 0, and the forbidden zone between them vanishes. Second,
the “central region” of little surprise would not be in the visual center
between Zero/Shrew and Whale. This visual center represents the
arithmetic mean of shrew and the largest whale, and it would
correspond to a median-sized whale! Most mammals range around the
geometric mean of Shrew and Whale, but in this picture this range
would be so close to Shrew that it, too, would be indistinguishable from
zero. In Figure 5.1, I had to expand this part of the graph to make it
visible.) To get a more informative picture we must shifted to a different
scale.
Figure 5.2. Weights of mammals on a strictly even scale.
Logarithmic scale
What is the lo-gar-ithm of a number? It’s less scary than it looks. For
numbers like 0.01, 0.1, 1, 1000 and 1,000,000, their logarithm (on the
basis of 10) is simply the number of zeros. Count the zeros on the
right of “1” as positive. Thus log1000=3 and log1,000,000=6. Count the
zeros on the left of “1” as negative. Thus log0.01=-2 and log0.1=-1.
What about 1? It has no zeros, so log1=0. Now graph numbers
according to the distances between their logarithms. This is done in
Figure 5.3.
43
5. Forbidden and Allowed Regions: Logarithmic Scales
Figure 5.3. Integer exponents of 10 and their logarithms.
In other words, logarithmic scale is a scale where you add equal
distances as you multiply by equal numbers. On this scale, the distances
from 1 to 10, from 10 to 100, and from 100 to 1000 are equal, because
each time you multiply by 10. In the reverse direction, the same
distance takes you from 1 to 0.1 and from 0.1 to 0.01, because each time
you divide by 10. This is how we get the minus values for logarithms of
0.001 etc.
Logarithmic scale goes by the “order of magnitude” of numbers,
which roughly means the number of zeros they have. Note in Figure 5.3
that 10,000=104 and log10,000=4. Also 0.01=10-2 and log0.01=-2. Do
you notice the correspondence between exponents and logarithms? We
have 10,000=10log10,000, and 0.01=10log10.01! This is so for all numbers:
n = 10logn.
Let us show the weights of mammals on such a logarithmic scale
(Figure 5.4). When using kilograms, the shrew is at 0.003; it logically
must be between 0.001 and 0.01. The blue whale is at 30,000; it
logically must be between 10,000 and 100,000. Now the visual center of
the allowed region is around the geometric mean of shrew and blue
whale, which is 10 kg – an average dog. The surprise regions evenly
flank the geometric mean.
Figure 5.4. Weights of mammals on a logarithmic scale: Allowed
region
But what happens to the utterly forbidden region? It vanishes because,
on the logarithmic scale, the zero point shifts toward to the left,
infinitely far. Indeed, we may add as many zeros as you wish behind the
44
5. Forbidden and Allowed Regions: Logarithmic Scales
decimal point, 0.000000…0001, and we are still higher than zero. The
logarithmic scale has no zero point – it is infinitely far away. Objects
cannot have zero or negative weights, and logarithmic scale cannot
show zero or negative numbers. The two are well suited for each other.
Exercise 5.1
a) Make a list of coins and banknotes used in your country.
Mark their locations on a logarithmic scale. (Place “2” at
one-third of the distance from 1 to 10, “3” at one-half, and
“5” at two-thirds.)
b) Comment on the distances between these locations.
c) Now suppose I asked you to mark their locations on a
regular scale. In your response, please stay polite. “Go fly a
kite” is acceptable.
d) Comment on the statement “We do not encounter the
logarithmic scale in everyday life".
* Exercise 5.2
a) Show the forbidden and allowed areas for the number of
seat-winning parties in a 100-seat district, on logarithmic
scale. Also show the geometric mean. On the log scale, 2 is
about one-third of the way between 1 and 10; 3 is halfway,
and 5 is two-thirds toward 10. Do not bother about
distinguishing between different types of forbidden ranges,
nor between surprise and central ranges.
b) Do the same for Casablanca deaths in Exercise 1.2.
c) Do the same for the most populated US state. [NOTE: We
do not have to start from 1 person! Start with a million.]
d) Do the same for the largest US state. [We do not have to
start from 1 km2!]
e) Try to do the same on regular scale, for parts b to d, if you
feel it’s simpler. Draw some conclusions.
When numbers multiply, their logarithms add
We said that logarithmic scale is a scale where you add equal distances
as you multiply by equal numbers. From this, the following results (for
proof see Chapter 13):
log(ab) = log a + log b.
45
5. Forbidden and Allowed Regions: Logarithmic Scales
For instance, log(100×1000)=log100+log1000. Yes, 5 zeros equal (2+3)
zeros. As another example, take 2×5. This is the same as 1×10. Hence
the distance log2+log5 in Figure 5.3 must be the same as the distance
log1+log10, which is 0+1=1. So, whatever the values of log2 and log5,
they must add up to 1.
How much is log(1/a)? Look at 10 and 1/10 in Figure 5.3. Their
logarithms are 1 and -1. This is general:
log(1/a) = -loga.
It follows that
log(a/b) = log a - log b.
What about the logarithm of 53? log(53)=log(5×5×5)=log5+log5+
log5=3log5. We can generalize: When a number is to power n its logarithm is multiplied with n. Thus,
log(an) = n loga.
If you merely memorize these formulas, you risk getting confused. It’s
safer to make an effort to understand why they must be true. We are
now in a position to calculate logarithms in C=42/N2, in exercise 2.1:
logC=log(42/N2)=log42+log(1/N2) = log42-log(N2) = log42-2logN.
Note that this means a linear relationship between logC and logN. It’s
like y=a+bx, where a is log42, and the slope b is exactly 2.
Graphing logarithms
In the graph you constructed for Exercise 2.1, the point for Botswana is
twice as high as the next one, forcing us to extend the C scale. Also the
visual trend curve and the curve for model C=42 years/N2 are bent,
which makes it hard to compare data with respect to these curves. In
Exercise 2.3, on the other hand, the data points fall roughly on a straight
line, and the logical model also is a straight line. At this moment, I did
not tell you what these variables x and y represented. Now it can be told:
x=logN and y=logC. y=1.62-2x, 1.62 is log42, and the slope -2 comes
from the exponent of N. The two tables in Chapter 2 combine to Table
5.1.
46
5. Forbidden and Allowed Regions: Logarithmic Scales
Table 5.1. Number of parties and cabinet duration, and their logarithms.
Country
Botswana
Bahamas
Greece
Colombia
Finland
N
1.35
1.7
2.2
3.3
5.0
C (yrs.)
40
14.9
4.9
4.7
1.3
x=logN
0.130
0.230
0.342
0.519
0.699
y=logC
1.602
1.173
0.690
0. 672
0.114
Exercise 5.3.
Check the values of logarithms in Table 5.1, roughly, comparing them
to distances in Figure 5.3.
a) For values of N the logarithms must be more than log1=0 but
appreciably less than log10=1. Is this so?
b) For values of C, which logarithms must be between 0 and 1, and
which must be between 1 and 2? Are they so?
The remarkable thing is that taking logarithms changes the curve C=42
years/N2 into a straight line, y=1.62-2x, and a straight line is much more
manageable than a curve. When can logarithms turn a curve into a
straight line? We’ll come to that
Logarithms of numbers between 1 and 10 – and beyond
Table 5.2 shows the logarithms for numbers from 1 to 10. Why such
values? We’ll come to that. For the moment, accept them, for
information. Note that, pretty closely, log2=0.3, log 3=0.5, log4=0.6,
and log5=0.7. Often, this is close enough.
Table 5.2. Logarithms of numbers from 1 to 10 … and from 10 to 100.
x
logx
1 1.5
0 .176
2
.301
3
.477
4
.602
5
.699
6
.778
7
.845
8
.903
9
.954
10
1
x
logx
10 15
20
30
40
50
60
70
80
90
100
1 1.176 1.301 1.477 1.602 1.699 1.778 1.845 1.903 1.954 2
What about log300? Log300=log(100×3)=log100+log 3=2+0.477=2.477.
Look at Table 5.2: log30 is exactly 1 added to log3. So when we know
47
5. Forbidden and Allowed Regions: Logarithmic Scales
the logarithm of 3, we also know it for 30, 300, and 3000 – just add the
number of zeros!
Recall that 10,000=104 and log10,000=4. So 10,000=10log10,000. This
is so for all numbers:
n = 10logn.
Try it out on your pocket calculator. Take 10 to power 1.301, and see if
you get 20. (On a usual calculator, enter 10, push “xy”, enter 1.301, push
“=.” Or simply enter 1.301 and push “2nd” “log”.)
48
6. The Number of Communication Channels and
the Duration of Cabinets



The number of communication channels increases roughly as
the square of the actors.
The inverse square law of cabinet duration is one of the
consequences.
A law, in the scientific sense, combines empirical regularity and
explanation through a logical model.
There is more to logical model building than the ignorance-based
approach, which was used for the number of parties and for the size of
the largest component. Now we take on a problem where a very
different approach is needed. Start with an issue close to home. One of
the tasks of parents is to adjudicate squabbles among their children. The
more children, the more squabbles. This would be a directional model:
when x up, then y up. But how fast does the frequency of squabbles
increase as the number of children increases? Let us establish a
quantitative model.
The number of communication channels among n actors
With no children or with just one child (A), no conflicts among children
can arise. With two children (A and B), they can. When a third child (C)
is added, conflict frequency triples, because in addition to conflict
channel AB there is also AC and BC. My wife and I felt this tripling the
moment our third child began to walk. What would have happened to
the number of conflict channels if we had a fourth child? Write down
your answer.
It’s the same with parties in a political system. The number of
potential conflict channels increases faster than the number of parties. It
should affect the duration of governmental cabinets: The more parties,
the more conflicts; the more conflicts, the shorter the cabinet durations.
But the broad issue is much wider.
People form a society only if they communicate. Interaction among
individuals is the very definition of what society is. This involves any
sort of communication channels among social actors. The number of
communication channels increases enormously as more individuals are
added, and much of social organization addresses the issue of how to
6. The Number of Communication Channels and the Duration of Cabinets
cut down on this number. The shift from direct democracy to representative democracy is one such attempt.
How many communication channels (c) are there among n actors?
Draw pictures and count the channels, for a small number of actors.
You’ll find the following.
Number of actors, n
No. of conflict channels, c
0
0
1
0
2
1
3
3
4 ↓
5 ↓
6
6 → 10→ …
What is the general logical model for n actors? We can approach this
issue step by step, looking at the table above. Two actors certainly have
1 communication channel. A third one extends a channel to each of
them, thus adding 2 channels. That makes 1+2=3. A 4th actor extends a
channel to each of the existing 3. That makes 3+3=6. Thus, at each step
in the table above, n and c add, to produce the new c. At the next step is
4+610, and then 5+1015. We could continue, but it would take a
long while to reach, say 100 actors. A more powerful approach is
needed.
Each actor extends a channel toward each of the (n-1) other actors.
So the total, for n actors, is n(n-1). But we have double counted,
approaching each channel from both ends. So we have to divide by 2.
The result is
c = n(n-1)/2.
Don’t just believe it. Check that this model yields indeed the numbers in
the table above. The formula fits even for n=0 and n=1. These cases
may look trivial, but actually they are very important conceptual
extreme cases. If our model did not fit the extreme cases we would
have to ask why it doesn’t. It may be tempting to say: “Oh well,
extreme cases are special ones – you can’t expect the general rule to
apply to them.” This is not so. To the contrary, we often construct the
general model by thinking about simple extreme cases. Recall the
question about the number of seat-winning parties. The “gorilla
question” was: “What happens when the district has only one seat?”.
This pulled in the extreme case of one-seat districts. In the present case
we have to make sure that the general formula applies to n=0 and n=1
too.
When n is large, the difference between n-1 and n hardly matters.
Then the model simplifies into
c ≈ n2/2. [n large]
50
6. The Number of Communication Channels and the Duration of Cabinets
For n=100, the exact formula yields 9900/2=4,950. The approximate
one yields 5,000 – close enough. The number of communication
channels increases roughly as the square of the actors.
Society is by definition not just a bunch of individuals but of
individuals who communicate with each other. If so, then laws about the
number of communication channels among actors should be among the
most basic in social sciences, leading to many consequences. Among
them, we’ll discuss here the average duration of governmental cabinets.
The number of communication channels also determines the number of
seats in a representative assembly; this will be discussed in Chapter 19.
Exercise 6.1
Suppose a university has 60 departments, divided evenly among 10
schools headed by deans. Assume for simplicity that the university
president communicates only with the deans below her and the minister
of education above her. Assume that each dean communicates only with
the president, other deans and the department heads in her own school.
a) Determine the president’s number of communication channels
(P) and that for a single dean (D). Comment on the disparity.
(WARNING: Draw the actual scheme described and count the
channels – do not fixate blindly on some formula.)
b) The president finds that her communication load is too high and
proposes to divide the university into just 4 large schools. How
large would the communication loads P and D become?
Comment on the disparity.
c) The president matters of course more than a dean. On the other
hand, the deans are many. So let us take the arithmetic mean of
P and D as the measure of overall communication load. How
large would it be for a setup with 10 schools? And for 4
schools? Which of these numbers would be preferable, for
keeping communication load down?
d) Would some number of departments reduce this load even
further? If so, then to what level?
e) Would reduction in communication load seem to justify
reorganization, assuming that our way to measure it is valid?
Average duration of governmental cabinets
How can c≈n2/2 be used to determine the average duration of governmental cabinets? Cabinet breakdowns are caused by conflicts. Imagine
we reduce the number of potential conflict channels (c) among parties
by one-half; then the breakdown frequency of cabinets should also
51
6. The Number of Communication Channels and the Duration of Cabinets
decrease by one-half. Hence cabinet duration (capital C) should double.
But the number of conflict channels itself grows as the square of the
number of parties (N). Thus, if we reduce the number of parties by onehalf, cabinet duration should become 4 times longer. Such reasoning
leads to an “inverse square” relationship:
C = k/N2.
Here k is a “constant”, the value of which we do not know. Note that C
is measured in units of time. Since N is a pure number (without units), k
must also be in units of time, so as to preserve “dimensional consistency” between the two sides of the equation.
Exercise 6.2
“If the number of parties is reduced by one-half, cabinet duration should
become 4 times longer.”
a) Verify that equation C=k/N2 does lead to this outcome. You
do so by plugging in N/2 instead of N, and 4C instead of C.
Are the two sides of the equation still equal?
b) To carry out such verification, do we have to know the
value of constant k?
Does this equation enable us to predict duration of cabinets, for a given
number of parties? Not yet, because the value of constant k is not
known. How do we find it? Graph C vs. N for many countries and see if
any curve of form C=k/N2 fits the data cloud. The actual study
(Taagepera and Sikk 2010) included 35 countries. The 5 countries in
Table 5.2 (and the earlier Table 2.1) were chosen here so as to represent
fairly well this broader set.
In Exercise 2.1 you were asked to graph C vs. N and draw in the
trend curve for these data points. Then you were asked to draw in three
curves of form C=k/N2: C=21 years/N2, C=42 years/N2, and C=84
years/N2. You probably concluded that all the data points were in
between the extreme curves, and the central curve was fairly close to
your trend curve. This means that setting k at 42 years more or less
agrees with the data.
But it is pretty hard to draw a best-fitting curved line and to decide
how close it is to a model-based curve. Shifting to logarithms changes
the curves of form C=k/N2 into straight lines, logC=logk-2logN. Since k
can have many values, this is an entire family of lines, but all of them
have slope -2 and hence are parallel. In Exercise 2.3 you probably
observed, too, that the data points fell roughly on a straight line. Now it
52
6. The Number of Communication Channels and the Duration of Cabinets
was much simpler to move a transparent ruler and choose which line
looked like the best-fit line. (The best fit line can be determined more
accurately by statistical methods, but the main point remains: one first
has to convert to logarithms.) You most likely found that this line had a
somewhat steeper slope than the model line, y=1.62-2x. This was so for
these 5 data points. With 35 countries, the agreement with the expected
slope -2 is close.
In the actual study, we determined the statistical best-fit line with
slope -2 (not just any slope) and found its intercept with the x-axis. This
was found to be 1.62, which is the logarithm of 42. (How do we find k
when we know logk? On a usual pocket calculator, enter 1.62, and push
“2nd/LOG”.) So k=42 years, and the relationship is specified as
C = 42 years/N2.
This equation is logically based regarding N2, while the constant “42
years” is empirically determined. It predicts cabinet duration for a given
number of parties, within a considerable margin of error of ×÷2
(multiply or divide by two) – remember Botswana and Greece on your
graphs! Even so, this is the mean duration over a long period. One
particular cabinet can last much longer or fall within days, if there
should be a major scandal.
Let us review what has been done, in somewhat more abstract terms.
First assume that we have no logical model – just data. It is hard to
describe the general trend in any detail when looking at the graph C vs.
N, apart from the coarse directional description: “N up, C down”. On the
graph logC vs. logN, however, something new jumps to the eye: the
pattern visibly is very close to linear. This means that logC=a+b(logN).
This corresponds to C=k/Nb. Thus b is the same in both equations, while
logk=a. We can calculate the constants a and b in logC=a+b(logN), as
you did in Exercise 2.3.
Hence we get the b in C=k/Nb directly and k indirectly. We find,
however, that b is suspiciously close to -2. This may have a theoretical
reason. So start to look for this reason. Then c ≈ n2/2 supplies the
answer.
This is part of a broader chain of relationships, connections among
connections. Cabinet duration depends on the number of parties. The
number of parties, in turn, can be deduced from the assembly size and
the number of seats allocated in the average district. (Calculating the
number of seat-winning parties in Chapter 1 and the share of the largest
party in Chapter 4 were important steps in this chain.) Hence we can
deduce cabinet duration from the number of seats in the assembly and in
the district. This means that we can design for a desired cabinet duration
53
6. The Number of Communication Channels and the Duration of Cabinets
by manipulating the assembly size and the number of seats allocated in
the average district. Thus, this model building has reached a stage that
could be of practical use, even while the range of error is as yet quite
large.
Leap of faith: A major ingredient in model building
A crucial link in the model was actually pretty weak: “Imagine we
reduce the number of potential conflict channels (c) among parties by
one-half; then the breakdown frequency of cabinets should also
decrease by one-half.” Is that really so? One may accept that more
conflict channels make coalition breakdown more likely. But what is the
evidence that breakdown frequency is proportional to c? Should I give
up on building a model for cabinet duration until I find such evidence?
No.
The answer is two-fold. First, breakdown frequency being proportional to the number of conflict channels is the simplest approach to try,
unless we have contrary evidence. So let us try it, even while it is a leap
of faith! Second, if my assumption is wrong, it will catch up with me
later on: the resulting model will not fit the data. That the final model
actually correctly connects C to N is indirect evidence that the proportionality assumption is valid.
The world is full of people who point out why things cannot be done.
And then there are those who try anyway. They sometimes succeed.
Those who don’t dare to try never do. This applies to model building too.
* Exercise 6.3
Consider a pretty extreme a situation where each day sees a new cabinet
formed. How many parties would this correspond to?
a) Convert the equation C=42 years/N2 into the form N=... , with N
alone on the left side and all the rest on the right. Go slow and
apply the basic balancing rule of algebra, step by step (see box.)
b) Use this equation to calculate N when each day sees a new
cabinet formed. oahketarb
c) Now suppose, instead, that we actually feel no democratic
regime could withstand more cabinet changes than once a
month. How many parties would this correspond to? zuket.arb
d) Now suppose we draw a line at a situation where the effective
number of parties is 30. To what cabinet duration might it lead?
.nolnegsba  egysba p
54
6. The Number of Communication Channels and the Duration of Cabinets
The basic rule of algebra: Balancing
The equation C=k/N2 is nice for calculating C. But sometimes we want to
calculate N for a given C. Then we have to convert this equation into
something that looks like N=... , with N alone on the left side and all the rest
on the right. Sometimes we even wish to find out what k would have to be
for a given C and N; then we have to convert it into form k=... This is the
topic of Exercise 6.3.
The hard fact is that quite a few social science students find such
conversions hard – they just haphazardly try to multiply or divide, trying to
recall some rules they memorized in Algebra 1, without ever understanding
why these rules make sense. Now it’s time to understand, otherwise you’ll
be lost in the following chapters.
The basic rule is balancing the two sides of the equation. Suppose we
want to calculate k. So we want to have k on the left side, alone. Instead of
C=k/N2, revers it to k/N2=C. To get rid of 1/N2, multiply by N2 – but you
have to do it on both sides of the equation, so as to maintain balance:
N2(k/N2)= N2(C). But N2/N2=1, so we can cancel out the N2 upstairs and
downstairs: N2(k/N2)= N2(C), and k=N2C results. (Do not just read this. Put
the text aside, and practice it!)
Now suppose that we want to calculate N. This means we want to have N
on the left side, alone. Start with C=k/N2. Multiply by N2 – again on both
sides: N2(C)= N2(k/N2). Again, cancel out the N2s upstairs and downstairs,
so that N2C= k results. Divide on both sides by C: (N2C)/C= (k)/C.
Canceling out C/C leaves us with N2= k/C. Take the square root on both
sides: (N2)1/2= (k/C)1/2.As 2(1/2)=1, we are left with N= (k/C) 1/2.
Put the text aside, and practice it! If you merely copy this procedure in
answering the exercise above, you are badly short-changing yourself. You
would condemn yourself to be a rule-applying underling rather than a
thinking person. If we merely apply rules of moving symbols around, we
may inadvertently multiply instead of dividing. Say, we want to get rid of C,
but instead of dividing on both sides by C, we multiply. Such an error
cannot slip by when we follow the cancelling-out procedure: Instead of C/C,
which cancels out, we get CC, which cannot cancel out. Have the patience
to go slow and really apply this balancing rule, step-by-step, until it becomes so automatic that you can start cutting corners, cautiously.
Laws and models
What is a law, in the scientific sense? It’s a regularity that exists, and
we know why it holds. A law can come about in two ways. Some
regularity may be first observed empirically and later receives an explanation through a logical model. Conversely, a model may be deduced
55
6. The Number of Communication Channels and the Duration of Cabinets
logically, and later is confirmed by testing with data. Most often, it is
messier than that:
Some limited data  an urge to graph them  seeing a puzzling
relationship  an urge to look for a logical explanation  a tentative
logical model  more data  possibly adjusting the model  possibly
refining the way data are measured  more data collection and model
adjustment ….  a relationship that qualifies as a law.
For the purpose of learning model building, the main point here is
that the logical model for cabinet duration is quite different from the
previous. This model is based on the idea of communication channels.
Before you may start thinking that the ignorance-based approach is all
there is to model construction, the example of cabinet duration serves as
a warning that logical models come in an infinite variety of ways,
depending on the issue on hand. One has to think.
Now you are in a position to follow the history of a research issue.
This may be useful in perceiving how quantitative research proceeds
more generally. Someone else had calculated and tabulated the data and
observed that duration decreased when the number of parties increased
(Lijphart 1984: 83, 122, 124–126). Thus, a directional model was
established. I graphed Lijphart’s data, because I wanted to have a
picture of the relationship. The pattern was clearly curved (like your
graph in Exercise 2.1). So I graphed the logarithms (like your graph in
Exercise 2.3), hoping the data cloud would straighten out. It did, and the
slope in this graph was suspiciously close to -2.0, meaning an inverse
square relationship. Such simple integer values do not happen just like
that – “Maybe they want to tell us something,” I said to myself.
For a long time, however, I could not figure out what the data graph
was trying to tell me – it isn’t such a straightforward process. After a
while, the connection to communication channels occurred to me. But
the process started with graphing the data. This is why I have explained
the process here in some detail, so that you can not only admire it but
also use it.
56
7. How to Use Logarithmic Graph Paper




Many logical models and empirical data patterns graphs in
social science are best presented on logarithmic scales.
It is easy to establish that log2=0.30 and log5=0.7.
Graphing numbers on log scale and graphing their logarithms
on regular scale leads exactly to the same image.
Some curved data clouds y vs. x straighten out when graphed on
fully logarithmic (log-log) graph paper. Some others do so on
semilog graph paper. For some, neither works.
When we want to graph logarithms, it’s a nuisance to have to calculate
the logarithms for each data point, the way I had to do for Table 5.1.
There is a shorter way: use logarithmic graph paper. The instructor may
distribute to you sample copies, which you can use as master copies for
producing what you need. Logarithmic graph paper can also be found
on the Internet, but be cautious – the format offered may or may not suit
your needs. To make an intelligent choice, we have to understand what
is involved. First, let us extend our grasp of logarithms.
Even I can find log2!
We said that logarithmic scale is a scale where you add equal distances
as you multiply by equal numbers. On this scale, the distances from 1 to
10, from 10 to 100, and from 100 to 1000 are equal. But we must be
consistent. If multiplying by 10 adds equal distances, then this must be
so for all numbers. For instance, distances must be equal when
multiplying by 2: 1, 2, 4, 8, 16, 32, 64, 128... And the two scales must
agree with each other. This means that 8 must place a little lower than
10, and 128 must place a bit higher than 100. This can be done, indeed,
as shown in Figure 7.1.
Figure 7.1. Integer exponents of 10 and 2, and their logarithms.
7. How to Use Logarithmic Graph Paper
Now comes the creative step. Note that 210=1024 is awfully close to 103.
Hence we can write 210≈103. This means that 10 times log2 roughly
equals 3 times log 10: 10log2≈3log10=3. It follows that log2≈3/10=
0.30. The log2 cannot logically be anything else! Actually, log2=
0.30103 because 1024 is slightly larger than 1000. The point is that it is
within your ability to determine logarithms of simple numbers. We can
do so only approximately, but often this suffices. Of course, we most
often use pocket calculators to calculate logarithms, but it makes the
“logs” less scary to know we could find them even without a calculator.
Since 4=22, it immediately follows that log4=2log2≈0.60. And since
8=23, log8≈... (fill in the blank). What about log5? Note that 5×2=10. So
log5+log2=1, and log5≈... (fill in the blank).
* Exercise 7.1
a) Determine log3, approximately. Hint: Note that 34=81≈80=2310.
Take logarithms on both sides. [Assume log2=0.30 and log5=...
from above.]
b) Determine log7, approximately. Hint: Note that 72=49≈50=510.
Take logarithms on both sides.
c) To complete the list of logarithms of integers 1 to 10, calculate
log6 and log9. Compare our approximate values to those in Table
5.1.
And now I ask you something really scary. Estimate log538.6. How in
the world could you know that? Relax. This is approximately log500.
Which, in turn, is log(5×100). We found that log5≈0.70. So log538.6≈
0.70+2=2.70. How close are we? Enter 538.6 on your pocket calculator
and push LOG: 2.731. Close enough!
(If we are real ambitious, we might note that 538.6≈540=2×27×10=
2×33×10. So log538.6≈0.30+3×0.475+1=2.725. But if we need that
much precision, better get a calculator.)
Placing simple integer values on logarithmic scale
In the previous Figure, we could now place the logarithms of 2, 3, 4 etc.
on the log x scale (which is a regular scale), but instead, we could place
the numbers themselves on the x scale (which is logarithmic), as shown
in Figure 7.2. We have not bothered writing out the zeros in 10, 20,
200, etc. because it is sort of superfluous. Indeed, if we decide that one
58
7. How to Use Logarithmic Graph Paper
of the “1” on this scale really stands for 1, then the next “1” and “2” on
its right must stand for 10 and 20, and so on. And the “1, 2, ...5” on its
left must stand for 0.1, 0.2 … 0.5. Each “period” (from one “1” to the
next “1”) looks exactly like another one – it’s just a matter of dividing
or multiplying the numbers shown by 10.
Why bother with all that? The happy outcome is that, instead of
laboriously calculating logarithms for each number we want to graph on
the regular scale at the bottom of Figure 7.1, we can just place those
numbers themselves on this logarithmic scale. This is precisely what the
logarithmic graph paper does for us – and it tremendously speeds up
graphing.
Figure 7.2. The numbers themselves placed on the “logarithmic scale”
correspond to their logarithms placed on the regular scale.
CAUTION! Some students sort of logically, but mistakenly, think that
numbers should be graphed on regular scale and their logarithms on the
logarithmic scale. No! The reverse is the case. Graphing numbers on log
scale and graphing their logarithms on regular scale leads exactly to the
same image. The log graph just spares us from calculating each
logarithm separately.
Fully logarithmic or “log-log” graph paper
Maybe the paper supplied to you looks like the one in Figure 7.3. It has
1 period in the horizontal direction and 2 periods in the vertical. Other
graph papers may have only one period in both directions, or as many as
5 in one direction. Note that all 1-period squares are identical. Thus, if
we should need more periods than the paper has, we can just paste
together as many as we need. Further advice is given, using Exercise 7.2
as a starting point.
59
7. How to Use Logarithmic Graph Paper
Figure 7.3. Blank grid of fully logarithmic (“log-log”) graph paper,
with 1 period in the horizontal direction and 2 periods in the vertical. It
has “1 … 1 …1” at both axes.
1
5
3
2
1
y
5
3
2
1
1
2
3
5
1 x
* Exercise 7.2
Use the N and C data in Table 5.1. Use graph paper supplied or
available on Internet.
Do NOT use the example above or try to construct your own graph
paper, if you can avoid it. See advice that follows this Exercise.
a) Place the data points (N,C) on the log-log grid (NOT their
logarithms!).
b) Comment on the shape of the resulting pattern.
c) Draw the best-fit line through the data points.
d) Superimposed on the previous (not separately!), graph the
logical model equation C=42 years/N2, using at least 3 values of
N. Comment on the resulting shape.
e) Draw in the line going from point (1, 100) to (10, 1). It has the
slope -2, given that it drops by 2 periods on the y-scale while
advancing by one period on the x-scale. Compare the slopes of
your best-fit line and of C=42 years/N2 to this comparison slope.
f) Compare to your graph in Exercise 2.3. Do they differ, apart
from different magnification?
60
7. How to Use Logarithmic Graph Paper
First look at how many periods from 1 to 10 you need. In Table 5.1, the
number of parties ranges from 1 to 5, so one period will suffice.
Duration ranges from 1 to 40 years, so we need two periods – 1 to 10
and 10 to 100. At the top period, expand the labels “1, 2, …1” to 10, 20,
… 100, as shown in Figure 7.4. Now you are ready to graph the data
points (N,C).
Figure 7.4. Grid for log-log graph, with “1 … 1 …1” filled in as “1 …
10 …100” for the purposes of Exercise 7.2.
100
50
30
20
10
y
5
3
2
1
1
2
3
10 x
5
Slopes of straight lines on log-log paper
Measuring slopes of lines on log-log graphs (such as required in
Exercise 7.2, part e) may be confusing, at first. Forget about the scales
shown! Take a ruler and measure the physical distance by how much
the vertical distance changes when the horizontal distance changes by a
given amount. For slope, divide these distances.
CAUTION: Be careful with log-log papers in the Internet. Sometimes
the periods in horizontal and vertical directions differ in length, so as to
fill an entire page. This is illustrated in Figure 7.5. Here the x-scale has
been pulled out to twice the length, compared to the y-scale. If we draw
61
7. How to Use Logarithmic Graph Paper
a line from bottom left to top right, it seems to have slope 1, but it really
has slope 2. Why? When logx grows from 0 to1, then log y grows from
0 to 2. To get the real slope for any other line on this grid, we would
also have to correct the apparent slope by the same factor 2.
The periods on the two scales do not have to be equal, in principle.
However, unequal scales distort the slopes of lines, which play an
important role in Exercise 7.2. This may confuse you, if your experience with log-log graphs is still limited.
Figure 7.5. Risky grid for log-log graph: x-scale pulled out to twice the
length.
100
50
30
20
10
y
5
3
2
1
1
2
3
5
10 x
Exercise 7.3
To graph population vs. area of countries in the world, one must reach
from 21 square kilometers (Nauru) to 17 million (Russia), and from
13,000 for Nauru to 1.3 billion for China on the population scale.
a) How many 1-to-10 periods do we need on the area axis?
sba
b) How many 1-to-10 periods do we need on the population axis?
hat
NOTE: We do NOT have to start with 1 square kilometer or 1 person.
62
7. How to Use Logarithmic Graph Paper
Semilog graphs
On log-log graphs both axes are on logarithmic scales. But one can also
use log scale only on one axis, the other being on a regular scale. This is
often called a “semilog” graph. Its grid is shown (imperfectly) in Figure
7.6.The numbers on the regular x-scale are of course arbitrary.
Figure 7.6. Grid for semilog graph y vs. x, 2 cycles on the y-axis.
100
50
30
20
10
y
5
3
. 2
1
0
2
4
6
8
10
12
x
* Exercise 7.4
Estonia’s national currency prior to adoption of the euro (2011) had 12
different coins and bank notes. They ranked as follows, by increasing
values:
Rank 1
2
3
4
5 6
Value 0.05 0.10 0.20 0.50 1 2
7 8
5 10
9 10
25 50
11
100
12
500
a) Graph value on log scale and rank on regular scale. (We need
graph paper that covers five 1-to-10 periods. If your paper has
less, then cut and paste extra ones.)
b) Draw in the trend curve or the line that respects nearly all points.
(Do NOT join the individual points with a wiggly curve!)
63
7. How to Use Logarithmic Graph Paper
c) Notice and describe one blatantly irregular point.
d) Draw conclusions regarding regularities and irregularities in
this pattern.
e) Make a list of coins and bank notes used in your country and
repeat the exercise.
f) Compare the slopes of the two curves.
Why spend so much time on the cabinet duration data?
The model connecting cabinet duration to the number of parties deserves well one chapter, as an example of what the basic communication
channels model c=n(n-1)/2 can lead to. But why has it been invoked
repeatedly from Chapter 2 on? The reason is that it supplies also a
convenient example to demonstrate three different ways to graph the
same data. C vs. N on regular scales produced a curve, and curves are
harder to handle, compared to straight lines. Then we showed that using
logC vs. logN turns this curve into a straight line. This could be done in
two ways, which lead to the same result: calculate logC and logN for
each case separately and graph again on regular scales, or avoid such
calculations and graph C and N themselves, but on logarithmic scales.
This is not a universal recipe. Try graphing the currency data in
Exercise 9.4 on log-log paper, and see what you get. Here the semilog
paper straightens out the pattern – provided we place the ranks on the
regular scale, not vice versa. Try graphing the cabinet duration data on
semilog paper, and see what you get. Here the log-log paper straightens
out the pattern. Sometimes neither works.
Regular, semilog and log-log graphs – when to use which?
When data include both very small and very large values, a logarithmic
scale is the only way to tell apart the median countries from the tiny
ones. But the number of parties does not vary that much. We still
graphed it on log scale. We did so because the logical model suggested
that the expected curve would then turn into a straight line.
How do we know then which way to graph, in general? Some guidelines will be given later on, but at times I have no idea. Then I graph the
data in several ways, and sometimes a linear pattern appears for some
way of graphing. Then it’s time to ask: What is the reason behind this
regularity? This may be the starting point for trying to construct a
logical model.
64
B. Some Basic Formats
8. Think Inside the Box – The Right Box







In any data analysis we should look for ability to predict and for
connections to a broader comparative context.
Our equations must not predict absurdities, even under extreme
circumstances, if we want to be taken seriously as scientists.
Poorly done linear regression analysis often does lead to absurd
predictions.
Always graph the data – and more than the data! Show the
entire conceptually allowed area plus logical anchor points.
Before applying a statistical best fit to two variables, graph
them.
To be acceptable, a data fit as well as a logical model must
result in a curve that joins the logical anchor points, if any exist.
If anchor points and data points do not approximate a straight
line, transform data prior to linear fit so as to get a straight line
that does not pierce conceptual ceilings or floors.
After any quantitative processing, look at the numerical values
of parameters and ask what they tell us in a comparative
context.
Up to now, we have mostly started by constructing a logical model and
then looking for data to test it. Now we start with some data. How do
we proceed so as to infer a logical framework? Often (though by no
means always), we are advised to think inside the box – but it has to be
the right one. Consider the following hypothetical data set of just 6
points (Table 8.1).
Table 8.1. Hypothetical relationship between satisfaction with the head
of state (x) and satisfaction with the national assembly (y).
x
y
2.0
0.2
2.5
0.2
3.0
0.9
3.5
1.1
4.0
2.4
4.5
3.6
Mean: 3.25
Mean: 1.40
Always graph the data – and more than the data!
If you have been exposed to a statistics course, you might have the
knee-jerk reaction to push the OLS (Ordinary Least Squares) regression
button on your computer program. It would lead to a line roughly
8. Think Inside the Box – The Right Box
corresponding to y=-4.06+1.68x (R2=0.9). This result would look highly
satisfactory, if one merely goes by high R-squared. (We’ll discuss later
what this R2 means – and does not mean.) But if the pattern is actually
curved – like in our Exercise 2.1 – this would be a misleading fit. Some
data patterns are horseshoe shaped, and here a linear fit would be utterly
senseless. How do we know that a linear fit makes sense? Always graph
the data!
For the data above, however, graphing leads to a pretty straight
pattern (Figure 8.1). Only a negligible curvature might be seen, so
linear regression looks justified. (I eyeballed the equation above from
this graph. Precise OLS procedure would add to formalism but little to
information content.) If you let the computer draw the graph, it may
even automatically add the best fit line, show its equation, and draw a
nice frame to box in the data points, as shown in Figure 8.1. This is the
wrong box, however. We should think outside this box.
Figure 8.1. Graphing only the data, regressing linearly, and boxing it
the area where data points occur. Dotted line: y=x.
Indeed, what does this graph plus regression equation tell us in a
substantive way? We can tell that support for assembly (y) increases as
support for president (x) increases, given that the slope b=1.68 is
positive. Moreover, y increases faster than x, given that the slope b=1.68
67
8. Think Inside the Box – The Right Box
is larger than 1. This is where analysis of such data often stops. What
else should we look for?
In any data analysis we should look for ability to
predict and for connections to a broader comparative
context.
What does this “broader context” mean? Play around with the
regression line, y=-4.06+1.68x. Ask, for instance, what would it predict
for x=2? It would predict y=-0.70. A negative approval rating? Approval
ratings usually start from zero. If this is the case here, then such
prediction makes no sense. It goes against the following norm:
Our equations must not predict absurdities,
if we want to be taken seriously as scientists.
Our equations should not do so even outside the empirical range of
input variables.
We should also ask why the intercept is around a=-4.06? What does
this number tell us? Why is it negative? Is it large or small compared to
the intercepts in some other data sets? Unless one asks such contextual
questions, it is pretty pointless to calculate and report a precise
regression equation that, moreover, predicts absurdities for some values
of x. It is high time to graph more than just the data.
Graph more than the data! This advice may sound nonsensical to
those who reduce science to data processing. “What else but data is
there to graph?” they may ask. But we have already graphed more. In
Chapter 5, we graphed forbidden and allowed regions. We will build on
that idea. But even before doing so, we might draw in the equality line,
if it makes sense.
Graph the equality line, if possible
We might start by graphing the equality line (y=x), if equality can be
defined. It certainly can, if x and y are percentages. Here it can, too,
provided that both ratings are on the same scale. In contrast, if x is a
country’s area and y its population, then no equality can be defined.
By thinking of equality, we add a conceptual comparison line: equal
support for both institutions. In Figure 8.1, it’s the dotted line, top left.
Now we can see that support for assembly always falls short of support
for president, even while it seems to be catching up at high values of x.
68
8. Think Inside the Box – The Right Box
One may say that this was obvious even without graphing the
equality line – but I have seen too many published graphs where this
line was not drawn in and the “obvious” implication was missed. Would
y ever catch up with x? To answer this question, it is time to graph even
more, besides the data.
Graph the conceptually allowed area – this is the right box to think in
I should have told you right at the start about the scale on which people
were asked to rank the president and the assembly. It matters whether
the given data refer to variables that can range from 0 to 5 or from 0 to
10 – or worse, from 1 to 5. The range can even differ on the two axes.
Published articles all too often hide or omit this all-important information. Suppose the conceptually allowed range is 0 to 5 on both axes.
This is the box (Figure 8.2) that has a logical meaning. Now the same
data take on a different appearance: The curvature somehow looks more
pronounced. We are also motivated to ask: What are the extreme
possibilities?
Figure 8.2. The limits of the conceptually allowed area and the logical
anchor points. The dashed line is the regression line.
69
8. Think Inside the Box – The Right Box
Our data suggest that the assembly’s ratings do not surpass the president’s. (This is most often the case, all over the world.) These ratings
also cannot fall below 0. In the absence of any other information, what
could the assembly’s rating be when even the president’s rating drops as
low 0? The only answer that agrees with the considerations above is that
the assembly also must have zero rating. Similarly, when even the
assembly is rated a full 5, then the information above does not allow the
president to have any less than 5. These points (0,0 and 5,5) are logical
anchor points for such data. They are indicated in Figure 8.2 by
triangular symbols reminiscent of letter A for “Anchor”.
Now we come to a corollary of the previous anti-absurdity norm:
A data fit is meaningful only when it remains within
the conceptually allowed area and includes the logical
anchor points.
The linear data fit (dashed line in Figure 8.2) violates logic. For presidential ratings below 2.4, it would predict negative ratings for the
assembly. For assembly ratings above 4.4, it would suggest presidential
ratings above 5. We must use a different format.
To be acceptable, a logical model or data fit must
result in a smooth curve that joins the anchor points
and passes close to most data points.
We should keep the equation of this curve as simple as possible. Figure
8.3 shows such a curve. How was it obtained?
The simplest curve joining the anchor points
The simplest algebraic format that takes us smoothly from 0,0 to 1,1 is
the “fixed exponent” equation
Y = Xk.
You can check that, whenever X=0, we are bound to have Y=0,
regardless of the value of exponent k. Similarly, when X=1 then Y=1 –
and vice versa. This is the most manageable format for two quantities
with clear lower and upper limits at 0 and 1. Physicists often call such
equation a power equation, saying that here “x is to the power k”. But
this term confuses and angers some political scientists who deal with
political power. So I use “fixed exponent equation”.
70
8. Think Inside the Box – The Right Box
Figure 8.3. Normalizing to range 0 to 1, and fitting with Y=X3.6.
In the present case the variables range is not from 0 to 1 but from 0 to 5.
This would make even the simplest possible form more complex:
y=5(x/5)k. It is less confusing to shift to different variables, which do
range from 0 to 1: Y=y/5 and X=x/5. Figure 8.3 shows these
“normalized” scales, and Table 8.2 tabulates the normalized data. It can
be seen that Y=Xk yields a fair fit to data (plus exact fit to anchor points)
when k is set at 3.6. Indeed, deviations of Y=Xk 3.6 from actual data
fluctuate up or down fairly randomly. But how was the value k=3.6
found? Logarithms again enter.
Table 8.2. Original and normalized data, and fit with Y=Xk.
x
y
X=x/5
Y=y/5
Y=X3.6
Deviation
0
0
0
0
0
0
2.0
0.2
0.40
0.04
0.037
-
2.5
0.2
0.50
0.04
0.08
+
3.0
0.9
0.60
0.18
0.16
-
71
3.5
1.1
0.70
0.22
0.28
+
4.0
2.4
0.80
0.48
0.45
-
4.5
3.6
0.90
0.72
0.68
-
5
5
1
1
1
0
8. Think Inside the Box – The Right Box
Recall that log(Xk)=klogX – when a number is to power k, its logarithm
is multiplied with k. So the equation Y=Xk implies logY=klogX and
hence
k=logY/logX when Y=Xk.
Pick a location on the graph that approximates the central trend, such as
0.60,0.16. (It need not be an actual data point.) Then k=log0.16/
log0.60=(-0.796)/(-0.222) =3.587≈3.6. On my pocket calculator the
sequence is: 0.16 “LOG” “÷” 0.60 “LOG” “=”. On some others, it is
“LOG”0.16 “÷” “LOG”0.60 “=”. If you still feel uneasy with logarithms, see Chapter 13.
CAUTION. Some students think they can replace log0.16/log0.60
with log16/log60. This is not so. Log0.16/log0.60=3.587, while
log16/log60=0.677 – quite a difference!
Once we have k=3.6, we can calculate Y=Xk 3.6 for any value of X.
E.g., for X=0.40, on my pocket calculator 0.40 “yx” 3.6 “=” yields
0.0369≈0.037, as entered into Table 8.2.
If we want a more precise fit to the data points, we can graph logY
against logX and determine the best-fit line that passes through the point
0,0. Given that logY=klogX, the slope of this line is k. For even more
precision, we can run linear regression of logY on logX on computer.
But how much precision do we really need? Even a coarse fit that
respects the conceptual anchor points is vastly preferable to a 3-decimal
linear fit that predicts absurdities. We need enough precision to compare
these data to some other data of a similar type. What does this mean?
We could compare the assembly and presidential ratings at a different time, or in a different country. Would the points still fall on the
curve for k=3.6, or would a different value of k give a better fit? Only
such comparisons lend substantive meaning to the numerical value
k=3.6, by placing it in a wider context. The implications will become
apparent in the next few chapters.
* Exercise 8.1
Suppose people have been asked to express their degree of trust in their
country’s legal system (L, rated on a scale from 0 to 1) and in their
country’s representative assembly (A, also rated on a scale from 0 to 1).
We wonder how L and A might be interrelated.
a) Graph A vs. L. Label the locations 0, 0.5 and 1 on both scales.
Mark off the areas where data points could not possibly be – the
forbidden regions.
b) When L=0, what can one expect of A? When A=0, what can
one expect of L? Keep it as simple as possible. Mark the
resulting point(s) on the graph.
72
8. Think Inside the Box – The Right Box
c) When L=1, what can one expect of A? When A=1, what can
one expect of L? Keep it as simple as possible. Mark the
resulting point(s) on the graph.
d) What is the simplest way to join these points? Draw it on the
graph, and give its equation. Let’s call it the simplest model
allowed by the anchor points.
e) Suppose we are given two data points: (L=0.45, A=0.55) and
(0.55,0.50). Show them on the graph. Is it likely that our
simplest model holds? Why, or why not?
f) Forget about those points. Instead, we are given three data
points: (L=0.45; A=0.35), (0.60,0.45) and (0.70,0.60). Show
them on the same graph, using symbols different from the
previous. (E.g., if you previously used small circles, then now
use crosses.) Is it likely that our simplest model holds? Why, or
why not?
g) Pass a smooth curve through the points in part (f) – a smooth
curve that predicts A at any allowed value of L, and vice versa.
What is the general form of the corresponding equation?
h) What is the specific equation that best fits the data points in part
(g), i.e., where constants have been given numerical values?
i) Assuming that this shape holds, how would you express in
words trust in legal system as compared to trust in assembly?
Support for Democrats in US states: Problem
Having worked through the previous example, you should now be in a
position to tackle the following. Figure 8.4 shows the Democratic percentages of votes in various US states in presidential elections 2000
compared to what they were in 1996, as reproduced from Johnston,
Hagen and Jamieson (2004: 50). The illegibly dark label at the line
shown reads “Vote2000 = -11.4 + 1.1*Vote1996”. It represents the OLS
(Ordinary Least Squares) regression line, which is one way to fit a line
to these data points. Several features can be added that might add to our
understanding of the changes from 1996 to 2000.
Before you read any further, do the following, and write it down.
a) List the features of interest that are missing.
b) Add these missing parts to the graph, as precisely as possible.
c) Try to find what the graph then tells you.
Try to do it on your own, before looking at the solution. Think in terms
of allowed areas, anchor points, continuity, equality lines, baselines, and
73
8. Think Inside the Box – The Right Box
simplest curves joining anchor points. This is not to say that all of them
enter here in a useful way.
Figure 8.4. The starting point: Data and regression line, as shown in
Johnston, Hagen and Jamieson (2004: 50).
Support for Democrats in US states: Solution
Let x stand for “1996 Democratic percentage” and y for “2000 Democratic percentage”. For both, the conceptually allowed range is from 0 to
100. Draw it in, as precisely as you can. How can we do it?
Take a piece of paper with a straight edge. Mark on it the equal
distances 30, 40, 50 … on the x-axis. Move it first so as to reach 0 on
the left and then 100 on the right. Mark those spots. The graph given
makes us do so at the level y=20. Repeat it at the top level of the graph.
Use these two points to draw the vertical line x=0. Do the same at
x=100. Then do the same for y-axis. (Watch out: the scale is not quite
the same for y as it is for x.)
74
8. Think Inside the Box – The Right Box
You may find that you do not have enough space around the original
graph to fit in the entire allowed region. If so, copy the graph, reducing
it. Or tape extra paper on all sides. Just don’t say: “It can’t be done,” for
technical reasons. You are in charge, not technology. Don’t be afraid to
use less than modern means, if they serve better the scientific purpose.
Now draw in the equality line, joining (0,0) and (100,100). Check
that it does pass through the (30,30) and (50,50) points visible in the
original graph. This is a natural comparison line, the line of no change,
from one election to another. It becomes visible that support for Democrats
1) decreased, from 1996 to 2000, and
2) not a single state went against this trend.
What about logical anchor points? If a state existed where support for
democrats already was 0 % in 1996, it would not be expected to buck
the trend and go up in 2000. Hence (0,0) is an anchor point. Also, if a
state existed where support for democrats still was 100 % in 2000, it
would not be expected to buck the trend and be less than that in 1996.
Hence (100,100) is also a logical anchor point.
The simplest curve that joins these anchor point again is Y=Xk, with
X=x/100 and Y=y/100. At X=0.50 we have approximately Y=0.44, in
agreement with the regression line shown in Figure 8.4. Hence
k=log0.44/log0.50=1.18, so that the resulting curve is Y=X1.18. Some
points on this curve are shown below:
X
Y
0
0
.1
.07
.3
.24
.5
.44
.7
.66
.9
.88
1
1
They enable us to draw in the approximate curve.
Figure 8.5 shows these additions. The graph might look a bit nicer
when done on computer, but it might be less precise, even if we could
scan in the original graph. Mules and computer programs can be
stubborn. It’s better to have it your way and correct than their way – and
incorrect. The curve Y=X1.18 that respects the anchor points almost
coincides with the linear regression line for y above 50 %. It differs
appreciably for low values of y.
What have we learned that wasn’t evident from the original graph? It
may seem from Figure 8.4 that states varied fairly widely in their
support for democrats. In contrast, Figure 8.5 pins down that this
variation was rather modest when compared to the conceptually
possible range. We can see that no state bucked the trend away from
Democrats. Compared to the regression line, the average trend is
75
8. Think Inside the Box – The Right Box
expressed in a way that does not predict absurdities at extreme values of
x or y. Johnston, Hagen and Jamieson (2004) explicitly excluded
Washington, DC from their graph, because its lopsidedly Democrat vote
in both years was outside the usual range. Yet, it would agree with
Y=X1.18.
A major payoff is that we express information in a more compact
form. Indeed, the regression equation involves two numerical values
(intercept -11.4 and slope 1.1), while Y=X1.18 makes do with a single
one (exponent 1.18). This would be extremely important, if we continued to study change in support during other periods. We can easily
compare changes in the single exponent k over many elections. In
contrast, the equations of regression lines would be hard to compare
systematically because we’d have to keep track both of intercept and
slope – and a single outlier could alter the intercept, even while it is just
noise from the viewpoint of systematic comparison..
Figure 8.5. Allowed region, anchor points, and model based on continuity between conceptual anchor points, fitted to data.
76
8. Think Inside the Box – The Right Box
Exercise 8.2
In Tajikistan 1926 male literacy was 6% and female literacy 1%
(Mickiewicz 1973: 139). By 1939, male literacy was reported as 87%
and female literacy 77%. In 1959, male literacy was 98%. Calculate the
best estimate for female literacy in 1959. Think inside the box.
zuz.nul, tsõudhrõ
77
9. Capitalism and Democracy in a Box






The world isn’t flat, and all relationships are not linear.
If the equality line makes sense on a graph y vs. x, enter it as a
reference line.
If both x and y can extend from 0 to a conceptual limit (such as
100), normalize it to go from 0 to 1.
If X=0 and Y=0 go together conceptually, and so do X=1 and
Y=1, then Y=Xk is the simplest format to be tested.
If Y=Xk does not fit, try (1-Y)=(1-X)k next.
Boxes with three conceptual anchor points sometimes occur,
leading to format Y/(1-Y) = [X/(1-X)]k.
This chapter uses previous methodology, so as to develop practical
ability to handle it. We start with a published graph on democracy and
capitalism, as an example, and develop it further. A slightly more
complex model is needed than the earlier. Finally, we address a case
with 3 anchor points.
Support for democracy and capitalism: How can we get more out of
this graph?
Figure 9.1 is reproduced from Citizens, Democracy, and Markets
Around the Pacific Rim (Dalton and Shin 2006: 250). It graphs the level
of popular support for capitalism (y, in percent) against the level of
support for democracy (x). What can we see in this graph? We see that y
tends to increase with increasing x, but with some scatter. One could
comment on the individual countries, but let us focus on the general
pattern. What else can we learn from it?
A knee-jerk reaction might again be to pass the best-fit line through
the cloud. This line may roughly pass through Korea and slightly below
Vietnam. One could use this line for prediction of democracy/capitalism
relations in countries not shown in this graph. But we would run into the
same problem as in the previous chapter. If support for democracy were
less than 13 %, the linear fit would predict support for capitalism lower
than 0 percent! Oh well, one might say, be realistic – support for
democracy never drops that low. Science, however, often proceeds
beyond what is being considered realistic. It also asks what would
happen under extreme circumstances. Recall the basic norm: a conceptual model must not predict absurdities even under extreme circum-
9. Capitalism and Democracy in a Box
stances. For every conceivable level of support for democracy a proper
model must predict a non-absurd level of support for capitalism – and
vice versa.
Could we say that the best-fit line applies only down to about 13 %,
and below that level support for capitalism is expected to be zero? This
would introduce a kink in the predicted average pattern, and most
physical and social relationships are smooth.
In sum, this is about all we can do with the best-fit line approach,
apart from adding a measure of scatter, like R-squared, which may be
around 0.5. For the given level of support of democracy, China looks
low on support for capitalism, while Philippines look high. What else
can we do? Try to do it on your own, following the approach introduced
in previous chapter. Only then continue reading.
Figure 9.1. The starting point: Data alone, as shown in Dalton and Shin
(2006: 250).
Note: SOME SCANNED-IN FIGURES MAY NOT COPY IN
NORTH AMERICA.
79
9. Capitalism and Democracy in a Box
Expanding on the democracy-capitalism box
This graph does present the data within a box – but it’s the wrong box.
Its borders have no conceptual meaning. They are like a frame around a
picture, not part of the picture itself. This box could be appreciably
wider or slightly smaller, as long as it does not infringe on the picture
itself. We need a box that is part of the picture. This means we
demarcate the allowed region where data points can conceivably occur.
Take a straightedge and draw in the horizontal lines at support for
capitalism 0 % and 100 %, and the vertical lines at 0 and 100 % support
for democracy. The new lines are parallel to the ones already shown.
They produce a square inside the earlier square – and this square has
conceptual meaning.
In contrast, the square shown in Figure 9.1 is mere decoration, a
more austere version of mermaids embellishing the margins of ancient
maps. This square is the more dangerous because it is so close to the
box separating the forbidden and allowed regions, so that we may
mistake it for the allowed region. Many computer programs draw this
wrong box in automatically – one more reason to do our graphs by
hand, until we become sufficiently proficient to be able to control the
canned programs.
Figure 9.2. Data plus allowed region and equality line.
80
9. Capitalism and Democracy in a Box
Next, draw in the equality line, y=x. This doesn’t always make sense, but
here it does, because both axes have the same units (or quasi-units) –
percent. These two additions are shown in Figure 9.2. At once, we see
more clearly that all the data points are below the equality line. It would
seem that people in all countries tend to voice more support for
democracy than for capitalism. The US, Singapore and Japan now look
closer to the right border than they did on the original graph. We realize
they are pretty close to the maximum possible. These are useful
insights.
Next, consider conceptual anchor points. If support for capitalism
is 0 % in a country, what level of support for democracy might we
expect? It would be hard to offer anything but 0 %. We cannot have less
than 0 %, and for proposing something more than 0 % we would have to
bend away heavily from the observed pattern. At the other extreme,
what level of support for democracy might we expect when the support
for capitalism is 100 %? Again it would hard to settle on anything but
100%.
Our simplest assumption is that the curve y vs. x starts at the point
(0,0) and ends at (100,100). This means that for any value of x that
can conceptually occur we have a value of y that can conceptually
occur – and vice versa. Our predictions may turn out to be wrong, but at
least we do have a prediction for any value of x or y, a prediction we can
subject to verification. This is better than saying “We cannot know”.
Could we offer other patterns? Maybe at 90 % support for democracy, support for capitalism would suddenly shoot up to 100 %. But
why would it be at 90 %, rather than at 95 %? Or could it be that people
never become 100 % supportive of capitalism, even at 100 % support
for democracy? But would they stop at 80 % or at 90 %? A basic rule in
science is: Keep things as simple as you can. Do not assume complexities unless evidence makes you do so. There are enough of real
complexities to deal with, without the need to introduce imaginary ones.
Fitting with fixed exponent function Y=Xk
Now determine the simplest curve that respects the anchor points
and the data. The simplest curve to join (0,0) and (100,100) would be
the equality line y=x. But it clearly does not fit the data. What is the
next simplest option? It’s a curve that initially keeps below the line y=x
and bends up to join it at (100,100).
What is a simple curve? It’s a curve with the simplest possible
equation, involving the least number of constants. For curves starting at
(0,0) and ending at (1,1), it’s the fixed exponent function or power
81
9. Capitalism and Democracy in a Box
function, Y=Xk. Here the exponent k expresses the deviation from
equality line Y=X. For k=1, Y=Xk becomes the equality line Y=X. The
more the value of k increases above 1, the more the curve bends
downwards. For k<1, it bends upwards.
Exercise 9.1
So as to get some feel what various values of k in Y=Xk imply, graph the
curves Y=X1,Y=X2,and Y=X0.5, superimposed, ranging from X=0 to X=1.
To sketch these curves, it will suffice to calculate Y at X=0, 0.25, 0.50,
0.75 and 1. Do not forget to think within the box – show this box!
In the graph on hand, however, the scales go from 0 to 100 rather than 0
to 1. We have to change scales so that the previous 100 becomes 1.
How do we do it? We have to switch from x and y to X=x/100 and
Y=y/100. In other words, we shift from percent shares (out of 100) to
fractional shares (out of 1). Why do we do it? If we kept the percent,
Y=Xk would correspond to y=100(x/100)k. Such a long expression would
be more confusing than it’s worthwhile.
Why Y=Xk is simpler than y=a+bx ?
The fixed exponent format may look complex. Would it not be simpler to
avoid curves and use two straight-line segments? The first one could run
from (0,0) roughly to the data point for Canada, at about x=79%, y=58%,
and the second one from Canada to (100,100). Let us consider how to
describe it. We’d have to write it so that the ranges are shown:
y=a+bx, 0<x<79,
y=a’+b’x, 79<x<100.
Count the letter spaces. The two straight-line expressions have 26 letter
spaces. That’s messy, compared to 4 letter spaces in Y=Xk. We better get
used to fixed exponent functions, Xk.
Two straight-line segments would be considered simpler than a single
curve only if one takes as an article of faith that all relationships in the world
are or should be linear. But they aren’t, and for good conceptual reasons, to
which we’ll come. Belief in straight-line relationships is even more
simplistic than the belief of Ptolemaic astronomers that all heavenly bodies
must follow circular paths. The world isn’t flat, and all relationships are
not linear.
82
9. Capitalism and Democracy in a Box
Which value of k would fit best the data in Figure 9.2? Recall the
previous chapter. Let us try to fitting to Korea (X=.78, Y=.58), as a point
fairly central to the data cloud. This leads to n=log.58/log.78=2.19.
Some points on the curve Y=X2.19 have been calculated, as explained in
the previous chapter:
X
Y
0
0
.2
.03
.4
.13
.6
.33
.8
.61
1
1
Figure 9.3 complements the previous one by adding the anchor points
(shown as triangle symbols) and the curve Y=X2.19 – and… this curve
visibly isn’t a good fit at low values of X! Indeed, only China is close to
the curve, while Vietnam, Philippines and Indonesia are all much
higher. What could we do?
Figure 9.3. Model based on continuity between conceptual anchor
points and fitted to data. (NB! The label “1.25” should read “0.57” – I
have not yet changed this scanned-in graph. Also the scales must be
read to go from 0 to 1.)
Fitting with fixed exponent function (1-Y)=(1-X)k
Note that we based our variables on support for democracy and
capitalism. We could as well reverse the axes and consider the lack of
support. This “opposition to democracy/capitalism” would correspond
83
9. Capitalism and Democracy in a Box
to variables 1-X and 1-Y, respectively. Instead of Y=Xk, the simplest
format now would be
(1-Y)=(1-X)k
and hence Y=1-(1-X)k.
Fitting again for Korea leads to k=log(1-.58)/log(1-.78)=0.57. Some
points on the curve Y=1-(1-X)0.57 have been calculated:
X
Y
0
0
.2
.12
.4
.25
.6
.41
.8
.60
1
1
This curve, too, is shown in Figure 9.3. It visibly offers a better balance
between China on the one hand and Philippines and Indonesia on the
other. So this is the curve we might take as a guide for predicting (i.e.,
offering our best guesses) for further countries.
We are now close to the best quantitative description one could
offer when respecting the anchor points. We could improve on it
slightly by taking into account not just Korea but all data points and
running a linear regression. But we first have to transform the data so
that a linear form is expected (recall previous chapter). We have to take
the logarithms, on both sides of (1-Y)=(1-X)k:
log(1-Y)=klog(1-X).
This equation means that log(1-Y) is a linear function of log(1-X).
Hence, the regression must not be “Y on X” but “log(1-Y) on log(1-X)”.
Moreover, this line must go through the point X=0, Y=0. I will not go
into more details.
Exercise 9.2
Literacy percentages for males and females in three developing
countries are (M=45, F=35), (60, 45) and (70, 60). Build a simple model
that avoids absurdities. Show it graphically and as an explicit equation.
Draw conclusions.
84
9. Capitalism and Democracy in a Box
* Exercise 9.3
The data in Figure 9.1 are roughly as follows.
__________________________________________________________
Country
Support for Support for Opposition to Opposition to
Democracy Capitalism Democracy
Capitalism
X’
Y’
Canada
81
57
China
60
29
Indonesia
54
48
Japan
82
52
Korea, S.
79
57
Philippines
42
36
Singapore
83
58
US
80
63
Vietnam
60
44
a) Calculate Opposition to Democracy / Capitalism on a scale
from 0 to 1 (NOT %!), and enter these numbers in the table.
b) Call Opposition to Capitalism Y’ and Opposition to Democracy
X’. (This reverses the previous labels: what was called 1-Y now
becomes Y’.) Graph the data points from (a).
c) Which form of equation should we use, for best fit?
d) Calculate the constant in this equation when X’=0.20 yields
Y’=0.40.
e) Insert this constant value into the general equation, and calculate the values of Y’ when X’=0.25, 0.5 and 0.75.
f) Mark these points on the graph and pass a smooth curve
through them.
g) To what extent does this curve agree with the data points?
Could some other smooth curve fit much better?
What is more basic: support or opposition?
We observed that fitting Y=Xk didn’t seem to work out. To fit the data,
we had to think in terms of opposition, instead of support. What does
this imply? Asking this question makes us discover that we implicitly
started out with the assumption that the supports of democracy and
capitalism are features that support each other. If so, then lack of
support is a leftover. Data suggest that, to the contrary, opposition to
democracy and capitalism may support each other. If so, then support
for democracy and capitalism would become a residual, a lack of
85
9. Capitalism and Democracy in a Box
opposition. The question may have to be set up in terms of avoidance
rather than support.
However, it may well be that this lack of fit with Y=Xk was just a
fluke due to having too few cases. Further data points from around the
world might crowd more near China than around Philippines, making us
shift back to Y=Xk and the primacy of support. But whatever pattern
might emerge when more cases are added, the question would remain:
Why does k have the value it has? Maybe we just have to accept it as an
empirical fact, just like we accepted k=42 years in C=42 years/N2. But
maybe there are further logical reasons why k cannot be much smaller
or larger than roughly 2 or 1/2. We should keep this “why” in mind.
Logical formats and logical models
Do we truly have here a logical model? We made use of allowed regions,
conceptual anchor points and requirements of simplicity and avoidance of
absurdity, so we were advancing toward logical model building. But we did
so only from a formal viewpoint. We have an intelligent description of how
views on democracy and capitalism seem to be related – a description that
fits not only the data but also some logical constraints. What we still need to
ask is: Why?
Why is it that the relationship is what it seems to be? Why should it be
so? We are not scientists, if we shrug off this question, being satisfied with
describing what is. We may not find an answer, at least not right now. But it
is better to pose questions without answers rather than asking no questions.
Questions posed may eventually receive answers. Questions which are not
posed will never be answered.
At the moment, we certainly have a logical format for data fitting, in
contrast to the illogical linear format, which predicts absurdities for some
values of variables. This is an indispensable step toward a model that
answers more about the “why?” question. Whether to call Y=Xk a logical
format or model depends on how much we read into the term “model”.
* Exercise 9.4
The graph below is reproduced from Norris and Inglehart (2003: 27).
The Gender Equality scale runs from 0 to 100. Approval of Homosexuality scale runs from 0 to 10. The line shown visibly is the OLS
regression line, Gender Equality regressed on Approval of Homosexuality.
86
9. Capitalism and Democracy in a Box
Make an exact copy of the graph. If you can’t use a copy machine, tape
this graph on a well-lighted window, tape blank paper on top of it, and
trace carefully all the points. Ignore country names, the distinction
between the Islamic and Western countries, and the few markedly
deviant countries. In other words, focus on the main data cloud.
Up to now, I have presented lots of hoops for you to jump through.
Now it’s time for you to set the hoops for yourself. Do with it all you
can, on the basis of what you have learned in this book. Work on the
graph and show it. Then comment on data, on data analysis previously
shown, and on your additions. (CAUTION: students tend merely to
comment, without showing their reworked graph. This is like presenting
the text of this chapter but omitting Figures 9.2 and 9.3.)
How can we know that data fits Y=Xk?
If we suspect that some data might follow the format Y=Xk, how can we
verify this guess? If it does, then logY=k logX. Graph logY against logX,
and see if we get a straight line. If we do, its slope is k. Caution is
needed when some values are close to 0.
Similarly, if we suspect that some data might follow the format (1Y)=(1-X)k, graph log(1-Y) against log(1-X), and see if we get a straight
line. If we do, its slope is k.
87
9. Capitalism and Democracy in a Box
A box with three anchor points: Seats and votes
The form Y=Xk would be our first guess when we have only two anchor
points: 0,0 and 1,1. But there are situations where another logical anchor
point enters, at 0.5, 0.5. This is the case when graphing seat shares against
vote shares.
Figure 9.4. The simplest curve through three anchor points (0,0), (0.5,0.5)
and (1,1): Y/(1-Y)=[X/(1-X)]k. The curve shown has k≈2, meaning that slope
is 2 at X=0.5.
1
Y
0.5
0
0
05
X
1
Certainly 0 votes should earn 0 seats, and all votes should earn all the seats.
In between, some electoral systems give a heavy bonus to the largest party.
If only two parties are running, vote shares 60-40 might lead to seat shares
78-22. But if both receive one-half of the votes, then we’d expect both of
them to earn one-half of the seats. The result is a “drawn-out S” shaped
pattern (Figure 9.4). When three conceptual anchor points impose themselves – (0,0), (0.5,0.5) and (1,1) – the simplest family of curves passing
through these anchor points is
Y/(1-Y) = [X/(1-X)]k.
This is the form symmetric in Y and X. For calculating Y from X, it can be
transformed:
Y = Xk/[Xk +(1-X)k].
Here parameter k can take any positive values. Data that fit such an equation
will be presented later (Chapter 21).
Can we straighten out such a drawn-out S into a straight line? We can.
Simply take the logarithms on both sides:
log[Y/(1-Y)] =klog[X/(1-X)].
If we suspect that some data might follow this format, graph log[Y/(1-Y)]
against log[X/(1-X)], and see if we get a straight line. If we do, its slope is k.
88
10. Science Means Connections Among
Connections: Interlocking Relationships





Science aims not only at connections among facts but also
connections among connections.
Interlocking equations are more powerful than isolated ones.
Parsimonious models with a single parameter, such as Y=Xk,
pack information more compactly than two-parameter models
such as the linear Y=a+bX. Consequently, they have more
connective power.
Algebraically formulated models do not ask which came first,
the chicken or the egg – they just go together.
The distinction between “independent” or “dependent” variables rarely matters – the chicken and the egg are interdependent. Either could be an input or an output, depending on
the issue on hand.
Science is more than just learning facts. Science deals with making
connections among facts. Empirical connections are a start, but logically
supported connections are much more powerful. Thus, C=42 years/N2,
with a logically supported exponent value 2, is stronger than an empirical best fit. But science aims at even more than connections among
facts. Science aims at making connections among connections (as
was briefly pointed out in Chapter 4). These connections can be of
different types.
Interlocking equations
One example is the sequence Tns1NC. From the number of
seats, T, we can infer the number of seat-winning parties: n=T1/2
(Chapter 1). From there, we can infer the relative seat share of the
largest party: s1=1/n1/2 (Chapter 4). Further logical steps, not present
here, lead from to the effective number of parties. From there C=42
years/N2 leads to cabinet duration. Many more than two quantities are
connected through a string of equations.
In such a string, any of the quantities can be treated as the input. For
instance, for a given largest seat share s1, one can calculate the resulting
N and C, as well as the number of assembly parties, n, from which such
a largest share is likely to result. In principle, any quantity can be used
10. Science Means Connections Among Connections: Interlocking Relationships
as input, and everything else results. In practice, large random errors
may accumulate.
Such strings are rare in physics. Rather, the basic building bloc is a
three-factor relationship. In electricity, V=IR relates voltage V, current I,
and resistance R. This equation is equivalent to I=V/R and to R=V/I:
Any factor can be calculated from the two others. It’s like a triangle of
factors. Change the value of any one factor, and the two others must
adjust.
But we also have another triangle, P=IV, introducing electric power
P. The two triangles together lead to a third: P=I2R. These triangular
relationships have one side in common, meaning two common factors.
Further such triangles connect with them only by one corner, meaning
one common factor: Charge q=It, where t is time, and electric field
E=V/r, where r is distance. And on we go to force between two charges
q and q’: F=kqq’/r2, where k is a universal constant. This k is somewhat
analogous to k in C=42 years/N2 in that they are empirically determined.
The equation F=kqq’/r2 is a 4-cornered pyramid rather than a triangle,
as it connects 4 variables.
Never mind the details of this example. The point is that there are
connections among factors, and connections among such connections.
In principle, similar interlocking relationships are conceivable in
social sciences. We just have to be on the lookout. Remember the
invisible gorilla (Chapter 3): We do not see what we are not trained to
look for, even when it is glaring at us.
Connections between constant values in relationships of similar form
A very different example of connectedness is offered next. Suppose two
quantities, A and C, are both connected to a third one, P, through
equations of form Y=Xk. This means that we have A=Pk and C=Ph,
where k and h are different constants. What can we say about the
relationship between A and C?
Make it more concrete by returning to the example in Chapter 8,
which connected the supports for the president (P) and the assembly
(A), Suppose that support for the cabinet (C) also was investigated. In
Chapter 8, the relationship between the supports for president and
assembly roughly fits the curve A=P3.6. If we should find that the
relationship is C=P2.1 between supports for president and cabinet, then
what can we say about the relationship between supports for cabinet and
assembly (C and A)?
90
10. Science Means Connections Among Connections: Interlocking Relationships
Take A=P3.6, which expresses P in terms of A. How can we transform it so as to express P in terms of A? Use the basic rule of algebra:
We can do anything, as long as we do it on both sides of the equation.
To get rid of exponent 3.6 for P, put both sides of the equation to
exponent 1/3.6: (A)1/3.6=(P3.6)1/3.6. The exponents for P multiply out to
1.Thus P=A1/3.6. Now plug this value of P into C=P2.1. We get
C=(A1/3.6)2.1, hence C=A2.1/3.6=A0.58. Two conclusions: 1) this relationship, too, is of form Y=Xk; and 2) the constant is obtained by dividing
one of the previous constants by the other.
We can repeat this example in more abstract terms: If Y=Xk and
Z=Xh, then X=Y1/k and hence Z=Yh/k.
Y=Xk and Z=Xh  Z=Yh/k and Y=Zk/h.
CAUTION: Do not blindly plug numbers into the latter equations. Try
to understand where they come from, and consider the situation on
hand. You may have Y=Xk and Z=Yh, and then Z=Xkh.
An important result of this connection between connections emerges:
Once we know how Y and Z are connected to X, we may not need to
graph Z against Y so as to determine the connection between them – we
can calculate it.
Another conclusion is that all relationships of form Y=Xk are
comparable in terms of the value of k. The relationship among male and
female literacy, as both increase, tends to be around F=M0.5 (Exercise
8.2), or inversely, M=F2.0. For support for democracy (D) and for
capitalism (C), the best fit of form Y=Xk was C=D2.29 (Chapter 9). (For
the moment we overlook the limited degree of fit!) Of course, F and M
are utterly different beasts, compared to D and C. Yet their relationships
have some similarity: they not only are both of format Y=Xk, but k≈2 in
both cases. What does it tell us?
In both cases separately we should ask: What is the underlying
mechanism or process that leads to k around 2 rather than around 1.5 or
3? And jointly for the two cases we might ask: Are there some very
broad similarities between these 2 processes? I have no answer. Maybe
some present student will find it.
91
10. Science Means Connections Among Connections: Interlocking Relationships
* Exercise 10.1
Lori Thorlakson (2007) defines three measures of centralization of
power in federations, all running from 0 to 1: Revenue centralization
(R); Expenditure centralization (E); Index of federal jurisdiction (J).
Their mean values in 6 federations, from the 1970s to 1990s, are as
follows. The averages of the three measures are also shown, as Index of
federal power (IFP).
Austria
Australia
Germany
Switzerland
US
Canada
R
.73
.73
.65
.58
.60
.48
E
.70
.59
.58
.48
.56
.42
J
.95
.87
.96
.92
.80
.85
IFP
.79
.73
.73
.66
.65
.58
a) In what order are the countries listed? What are the advantages
and disadvantages of this order, compared to alphabetical
listing?
b) Graph E vs. R, along with what else is needed – think inside the
box!
c) On the same graph, enter the points J vs. R.
d) Establish the likely shape of the relationships, and calculate
approximately the relevant parameters. egy.zuzarb .ketzuz
e) Graph the corresponding trend curves. Comment on the degree
of fit to the data points.
f) From the previous parameters (NOT directly from the data!),
calculate the approximate relationship between E and J. hr.hat
g) Graph, superimposed, the resulting curve E vs. J and the data
points. Comment on the degree of fit.
h) What do you think are the purposes of this exercise?
Instead of calculating the various logarithms involved in this exercise,
we can use log-log paper and graph Y against X. Note however, that
regression of logY against logX becomes difficult when some values of
X or Y are very low. Indeed, a single zero value blows up the entire loglog attempt. In such cases extra help may be needed. Recall that
sometimes (1-Y)=(1-X)k yields a better fit than Y=Xk.
92
10. Science Means Connections Among Connections: Interlocking Relationships
* Exercise 10.2
Use the data in previous exercise.
a) Graph E vs. R and J vs. R on log-log scale, along with equality
line. Given that all values are between 0.4 and 1, you need only
one period on log-log graph paper. If your paper has more
periods, try to magnify one period, so as to get more detail.
Note that the point (1,1) is at the top right corner – not at
bottom left, as we may be used to have it.
b) Draw in the best-fit straight lines that pass through (1,1). These
correspond to curves when graphing Y against X. CAUTION.
One of the anchor points vanishes to infinity. While the curves
Y=Xk converge at 0,0, the lines logy=k logx cannot converge,
because log0 tends toward minus infinity. To put it more bluntly:
the point (0,0) cannot be graphed on log-log paper; do not
confuse it with point (0.1,0.1), which has no special meaning.
c) Determine the slopes of these two lines, by measuring vertical and
horizontal distances, then dividing them. These are the values of k.
How do they compare with values of k observed in previous
exercise? CAUTION: Calculating the slope may be confusing.
Ignore the logarithmic grid. Use a regular centimeter or inch ruler.
Why linear fits lack connective power
When both variables have a floor and a ceiling, they can be normalized to X
and Y ranging from 0 to 1. If they are anchored at (0,0) and (1,1), then Y=Xk
or (1-Y)=(1-X)k often fits. Compared to linear fit Y=a+bX, the format Y=Xk
not only respects conceptual constraints but also makes do with a single
parameter (k) instead of two (a and b). This parsimony makes comparisons
of phenomena much easier and hence helps to answer “why?” It is also easy
to compare various data sets, ranking them by values of k, which express the
degrees of deviation from straight line.
Could such comparisons also be made using linear data fits? It is easy
to compare data sets on the basis of a single parameter k in Y=Xk, but it is
much messier to do so for two (a and b) in y=a+bx. Most serious, if the real
pattern is curved, then values of a and b can vary wildly for different parts
of the same data set. For example, if we calculate k in Y=Xk separately for
the lower and upper halves of the data in Figure 8.3, little will change. In
contrast, a and b in the linear fit will change beyond recognition. Comparisons with other data sets cannot be made on such a fleeting basis.
In this sense, the “simple” linear fit is not simple at all. It is well worthwhile to master the rather simple mathematics of the fixed exponent
equation rather than grind out linear parameter values that are meaningless
for comparison purposes. Even more broadly: It is easier to interconnect
connect multiplicative connections than linear ones.
93
10. Science Means Connections Among Connections: Interlocking Relationships
Many variables are interdependent, not “independent” or
“dependent”
We often think of processes in terms of some factors entering as
“inputs” and producing some “outputs”. When graphing an output
variable and an input variable, the convention is to place the input on
the horizontal “x-axis” and the output on the vertical “y-axis” (cf.
Chapter 2). When writing equations, the convention is to put the output
first: C=k/N2, rather than k/N2=C. But what was an input in one context
may become an output in some other. Eggs lead to chickens, and
chickens lead to eggs.
It is difficult to propose a unique direction when considering support
for democracy and capitalism (Chapter 9). Dalton and Shin (2006)
decided to graph Capitalism vs. Democracy (cf. our Figures 9.1 to 3).
Did they imply that support for democracy comes first and somehow
causes support for capitalism? Not necessarily. It is more likely that
both are affected by a large number of other factors (X, Y, Z) and also
boost each other:
(X, Y, Z)
 C

 D
The relationship of C and D may well be symmetric, but when
graphing, we are stuck with the choice between two asymmetric ways:
either we graph C vs. D or D vs. C. We have to do it one way or the
other, but don’t mistake it for a causal direction.
The same goes for equations. We have used Y for support for
capitalism and X for support of democracy, leading to Y=X2.19. But we
could as well transform it into X=Y0.405, where 0.405=1/2.19. And we
first have to find the value of k, from some given values of X and Y,
transforming Y=Xk into k=logY/logX. But later, each of the quantities k,
Y and X in Y=Xk may be the “output”, depending on the purpose of the
moment.
Such transformations of the same equation – Y=Xk, X=y1/k and
k=logY/logX – are inherent whenever we use algebraic equations: The
equality sign “=” is valid in all directions. Algebraically formulated
models do not ask which came first, the chicken or the egg – they just
go together. It will be seen in Chapter 16 that the OLS regression equations, in contrast, are one-directional and should be written as
y a+b1x1+b2x2+…, rather than with “=” .
94
10. Science Means Connections Among Connections: Interlocking Relationships
Some social scientists pay much attention to whether variables are
“independent” or “dependent”, but physical scientists are accustomed to
think in terms of interdependent variables. Causal direction may vary.
Sometimes the existing number of teachers affects literacy, and
sometimes existing literacy affects the number of teachers. Thus it
makes more sense to talk of input and output variables under the given
circumstances rather than inherently “independent” and “dependent”
ones. The terms “input” and “output variables” indicate their respective
roles in the moment's context, without passing judgment on some inherent causal direction.
Exercise 10.3
Exercise 8.1 dealt with people’s degree of trust in their country’s legal
system (L) and in their country’s representative assembly (A). Which of
L and A might be the independent and which the dependent variable?
Can we clearly decide, which way it goes? Would we lose anything, if
we assume that they are just mutually interdependent?
95
11. Volatility: A Partly Open Box







The foremost mental roadblocks in logical model building have
little to do with mathematical skills. They involve refusal to
simplify and reluctance to play with extreme cases and use their
means.
“Ignorance-based” models extract the most out of nearcomplete ignorance. They ask: What do we already know about
the situation, even before collecting any data? Focus on conceptual constraints.
Sherlock Holmes principle: Eliminate the impossible, and a
single possibility may remain – or very few.
Eliminate the “conceptually forbidden regions” where data
points could not possibly occur.
Locate the conceptual “anchor points” and conceptual ceilings,
where the value of x imposes a unique value of y.
This done, few options may remain for how y can depend on x –
unless we tell yourself “It can’t be that simple”.
Dare to make outrageous simplifications for an initial coarse
model, including as few variables as possible. Leave refinements for later second approximations.
When building a logical model, we can start with applying logic and
look for data to test it. Conversely, we can start with intelligent data
analysis and see what type of model it is hinting at. In the last few
chapters, we began with data. Previously, in the ignorance-based
approach, we began with logic. Now we are mixing thes approaches.
There will be some repetition. This is necessary for some steps to become almost automatic.
When noticing repetition, some people conclude: “I have already
seen that, so I can by-pass it.” Some others may observe: “I saw it
before, but it sounded either strange or pointless. Let me see if I can
now get more out of it.” Having seen and having grasped are quite
different stages.
11. Volatility: A Partly Open Box
The logical part of a coarse model for volatility
Volatility (V) stands for the percentage of voters who switch parties
from one election to the next. When more parties run, voters have more
choices for switching. So the number of parties (N) might increase
volatility. The directional model is “N up  V up”. Can we make it
more specific? Once again, you should try to answer this on your own,
until you get stuck, and then compare your approach to what follows.
How should we start? Graph V against N, so as to visualize the issue.
At first we can show only the two axes. Next, let us ask: What do we
already know about volatility and parties, even before collecting any
formal data?
The first step in constructing models often echoes the advice by the
fictional Sherlock Holmes: Eliminate the impossible, and a single
possibility may remain – or at least the field is narrowed down
appreciably. Apply it here. Mark off those regions where data points
could not possibly occur. These are the forbidden regions or areas.
Volatility cannot be less than 0 or more than 100 per cent. The number
of parties cannot be less than 1. These conceptually forbidden regions
are shown in Figure 11.1. The remaining allowed region has the form
of a box open to the right.
Figure 11.1. Individual-level volatility of votes vs. effective number of
electoral parties – conceptually forbidden regions (areas), anchor point,
and expected zone. (Note: The legend “Anchor point” at top left is confusing;
the actual anchor point is the triangle at lower left.)
CONCEPTUALLY FORBIDDEN AREA
100
-1
)
Anchor point
0(
N
60
V=
2
Surprise zone
CONCEPTUALLY
FORBIDDEN
Individual level volatility (%)
80
(N
10
V=
-1)
Expected
zone
40
20
0
0
1
2
3
4
5
CONCEPTUALLY FORBIDDEN AREA
6
7
8
9
Effective number of electoral parties
97
10
11. Volatility: A Partly Open Box
Make your models as simple as possible – but no simpler
You may feel that we have simplified to an unreasonable degree. Reality is
much more complex. Indeed, the foremost mental roadblocks in logical
model building have little to do with mathematical skills. They have to do
with refusal to simplify and reluctance to play with extreme cases and use
their means. Recall the advice attributed to Albert Einstein: “Make your
models as simple as possible – but no simpler.” Omit everything that isn’t
essential. Do not omit anything that is essential. And know the difference.
How do we know the difference? If we oversimplify, we’ll soon know – our
model would not fit the data. At this point, we make at least three
simplifying assumptions.
First, we assume that at least one party obtains some votes in both
elections. This restriction excludes the unlikely situation where a single party
has all the votes in one election but loses them all to a brand new party in the
next election.
Second, we assume that the same voters vote at both elections. This
simplification is serious, because real elections always have some voters
drop out and some others entering, from one election to the next. We should
not forget about it, but model-building best proceeds by stages. “As a first
approximation”, let us assume that negligibly few voters drop out or join.
Let us work out this simple situation first. If successful, we can go to a
second approximation, where we take into account more factors – not only
the shift in voters but also some other factors that might influence volatility,
besides the number of parties.
Third, we assume that the effective number of parties is an adequate
measure. Here it is calculated on the bases of vote shares, in contrast to
previous use based on seat shares. This N may change from one election to
the next. In this case use their mean. Arithmetic or geometric mean? It hardly
matters when the two values of N are quite similar, as they usually are.
Next, note that there is a conceptual extreme case. Suppose that only
one party runs in both elections, so that N=1. Here switching to another
party is impossible. Hence volatility must be zero. This point (N=1,
V=0) is marked in Figure 11.1 with a triangular symbol. This is a conceptual anchor point. At N=1, even a slight deviation of V away from
zero would violate logic. Of course, democratic countries practically
always have more than one party running. Logical models, however,
must not predict absurdities even under extreme conditions.
If V increases with N, our simplest tentative assumption could be
linear increase: V=a+bN, where slope b must be positive. This could be
any upward sloping straight line. But the anchor point adds a constraint.
All acceptable lines must pass through the anchor point (1,0). How do
we build this constraint into V=a+bN?
98
11. Volatility: A Partly Open Box
For N=1, we must have V=0. Plug these values into V=a+bN, and
we get 0=a+b. This means that a=-b, so that
V = -b+bN = b(N-1).
Now, among the infinite number of upward sloping straight lines, only
those will do where the initial constant equals the negative of slope.
Without any input of data, the conceptual anchor point approach has
already narrowed down the range of possibilities. Instead of having to
look for two unknown constants (a and b), we have only one. This is
tremendous simplification.
Introducing an empirical note into the coarse model
Now we proceed to shakier grounds. The effective number of parties
rarely reaches N=6. You may share a gut feeling that even with 6 parties
to choose from, not all voters will switch. If so, then V=100 percent at
N=6 would be a highly surprising outcome, although it is not conceptually impossible. Here we introduce a touch of empirical knowledge into the previously purely logical framework. The particular line
of form V=b(N-1) which passes through the point (6,100) is shown in
Figure 11.1. How did we find the equation of this line?
Plug the values (6,100) into V=b(N-1). The result is 100=b(6-1);
hence b=100/(6-1)=20. Thus, the equation of this line is V=20(N-1).
Any data point located above this line would be highly surprising,
although we cannot completely exclude the possibility, in contrast to the
conceptually forbidden areas. Hence this zone is marked as a surprise
zone in Figure 11.1.
So V=20(N-1) is roughly the highest value of V that would not
seriously surprise us. Do we also have a lowest value? No – even with a
very high number of parties, it is still conceivable that party loyalty of
voters could be complete. Thus no limit higher than V=0 can be
proposed, meaning a horizontal line at V=0, which is the x-axis.
Without any real data input, we have now narrowed down the
reasonably expected zone where data points could occur. It’s the cone
between the lines V=20(N-1) and V=0. In the absence of any other
knowledge, we have no reason to expect the actual line to be closer to
either of these two extremes. Therefore, our best “minimax bet” would
be the average of the likely extremes. The arithmetic mean of V=20(N1) and V=0 is V=10(N-1). We really should write it with a wavy
equality sign “”, because it is quite approximate:
99
11. Volatility: A Partly Open Box
V10(N-1).
Still, without resorting to any data, we have gone beyond a directional
model to a quantitative one. It is based on near-complete ignorance and
is shown in Figure 11.1. This model makes two distinct predictions, one
of them firm, the other quite hesitant.
1) If any straight line fits at all, it absolutely must have the form
V=b(N-1), so as to respect the anchor point.
2) The slope b would be around 10, very approximately. In other
words: “If you force me to guess at a specific number, I would say
10.”
Exercise 11.1
Here we took the arithmetic mean of slopes 20 and 0, to get V10(N-1),
despite my arguing in favor of the geometric mean in previous chapters.
What’s the justification?
Testing the model with data
Once constructed, such a model needs testing in two different ways:
Logical testing to guard against any absurd consequences; and testing
with actual data. Let us start with the last one.
Data from different countries are quite scattered, because many other
factors enter, besides N. But a uniform data set is available from Oliver
Heath (2005), for state-level elections in India, 1998–1999. Broad
conditions were the same for all the states, but many parties competed
in some states, while few did in some others. The mean values were:
N=3.65 and V=31.6. These values lead to b=31.6/(3.65-1)=11.9. Thus
our very coarse expectation of 10 was off by only 20 percent. For a
prediction not based on data, this is pretty good. So, at least for India,
V=11.9(N-1)=-11.9+11.9N.
Heath (2005) reports the best statistical fit as
[R2=0.50].
V=-9.07+11.14N
100
11. Volatility: A Partly Open Box
Figure 11.2 shows both lines and the data points. Neglect for the
moment the downward bending curve.
The “correlation coefficient” R2 indicates the goodness of fit to the best possible straight
line. (We’ll come to that in Chapter 15.) Perfect fit of data points to the line would yield
R2=1.00, while utter scatter would yield R2=0.00. So R2=0.50 reflects appreciable scatter
but still a clear trend.
Figure 11.2. Individual-level volatility of votes vs. effective number of
electoral parties: data and best linear fit from Heath (2005), plus coarse
and refined predictive models.
CONCEPTUALLY FORBIDDEN AREA
100
Anchor point
CONCEPTUALLY
FORBIDDEN
Individual level volatility (%)
80
60
40
V=
- 1)
(N
1.9
1
V=
4N
1 .1
+1
7
0
-9 .
V=
))
3(N-1
-0.14
1- e
100(
v
20
0
0
1
2
3
n4
5
CONCEPTUALLY FORBIDDEN AREA
6
7
8
9
10
Effective number of electoral parties
Which equation should we prefer? They are very close to each other,
compared to the wide scatter of data points. They fit these data
practically equally well. Both represent statistical best fit-lines, but for
two different ways to fit data.
Equation V=-9.07+11.14N results from the assumption that any
straight line is acceptable, meaning that any values of a and b in
V=a+bN are just fine. For N=1, it yields V=2.07 rather than the
conceptually required V=0. On a scale 0 to 100, the difference is small –
but it is absurd nonetheless to claim that with a single party available,
2% of voters still manage to change their vote.
In contrast, equation V=-11.9+11.9N results from the assumption
that only the lines passing through the conceptual anchor point are
acceptable. This line is the best statistical fit subject to this logical
condition.
101
11. Volatility: A Partly Open Box
If we want to predict the results of future elections in Indian states,
both equations are as good (or as bad, given the scatter in previous
data), but conceptually we are better off with the line that respects the
anchor point. This is even more so, if we want to guess at volatility in
elections elsewhere, because we are certain that the anchor point holds
universally. However, we should be prepared to find that the slope
might differ appreciably from 11.9 when it comes to countries with
political cultures different from India’s. So we might be cautious and
offer a universal quantitative prediction with a wide range of error –
maybe V=(123)(N-1).
Testing the model for logical consistency
So this takes care of testing the model with actual data. But we still need
logical testing, so as to guard against any absurd consequences. Look
again at Figure 11.2. What level of volatility does our model, V=
-11.9+11.9N, predict for N=10? It predicts more than 100 per cent! This
is absurd. We cannot plead that such large numbers of parties practically never materialize. A logical model must not predict absurdities even
under extreme conditions. If it does, it must be modified.
Indeed, in addition to the anchor point (1,0), we must satisfy another
extreme condition. There is a conceptual ceiling: When N becomes
very large, V may approach 100 but not surpass it. Mathematically:
When N∞ then V100 per cent.
The curve that bends off below the ceiling in Figure 11.2 corresponds to an “exponential” equation, V=100[1-e-0.145(N-1)]. This equation
may look pretty complex. Yet, it represents the simplest curve that
satisfies both extreme conditions – anchor point and ceiling – and best
fits the data. It is parsimonious in that it includes only one adjustable
parameter, which here is 0.145. What does this equation stand for, and
how was it obtained? This will be discussed later. We can say t hat
V=- 11.9 + 11.9N is a coarse model, a first approximation, and that
V=100[1- e-0.143(N-1)] is a more refined model, a second approximation.
Even when we have a conceptually more refined model, we might
prefer to use the simpler one because it’s easier to work with. In the
usual range of N the simple model works as well as the more refined
one – this is visible in Figure 11.2. There is nothing wrong with such
simplification, as long as we do not forget its limitations. If ever we
should get many values of N larger than 6, we should consider using the
refined model.
102
11. Volatility: A Partly Open Box
Exercise 11.2
In Figure 11.2, the single data point above N=6 actually agrees with the
coarse linear model better than with the exponential one. Then how can
I say that, for N>6, we should predict on the basis of the exponential
curve, when the straight line fits better?
Exercise 11.3
Second and further approximations can take many directions. Here we
have specified the relationship between V and N. We could consider
other factors that might affect volatility. We could also consider
possibly better ways to measure both volatility and the number of
parties. What about people who vote only in the first or only in the
second of the two elections on which volatility figures are based? How
might these occasional voters affect our model? I have no answers to
offer. If you find any, it would be impressive – but don’t spend too
much time on this exercise.
Exercise 11.4
Our model involves a box open to the right – as if N could go all the
way to infinity. Student Valmar Valdna pointed out to me (2.9.09) that
the adult population (P) of the country would impose an upper limit.
The maximum N would correspond to each person forming a separate
party and voting for this party. Each person most likely would stick to
her/his very own party the next time around, and so volatility would
drop again to zero! This would be another constraint. Can you work it
into the model? Hint: Find a simple expression in N and P that would
equal 1 when N=1 but would drop to 0 when N=P; then multiply the
existing V by this expression.
[In more mathematical language: find a function f(N) such that f(1)=1
and f(P)=0. This is how we often translate constraints into models.]
Exercise 11.5
Instead of the exponential approach to the ceiling, we could use a fixed
exponent approach: V/100=[(1-N)/N]k. Indeed, V/100=[(1-N)/N]4 yields
the following sample values:
N
1
V=100[(1-N)/N]4 0
1.5
1.2
2
6.2
3
19.8
4
31.6
6
49.6
10
6.56
This curve may fit the data cloud slightly better than does the exponential. Graph it on top of Figure 11.2 and see how the two curves
differ. Why do we tend to prefer the exponential format? See Chapter
20.
103
12. How to Test Models: Logical Testing and
Testing with Data






Logical models must not predict absurdities even under extreme
circumstances. Logical testing means checking the model for
extreme or special situations.
Most models apply only under special conditions, which should
be stated.
In testing models with data, eyeballing must precede statistical
tests; otherwise, the wrong type of test might be used.
Data often must be transformed prior to statistical testing of
data, guided by logical constraints. Logarithms often enter.
Study the graphs of data and model-predicted curves very
carefully. They may supply much more information than meets
the eye at the first glance.
We live in a world where multiplicative relationships overshadow the additive.
We already have touched on testing a logical model. It involves two
aspects:
 Logical testing, to guard against any absurd consequences.
 Testing with actual data.
Testing with data is often thought to be the only test a model needs, but
it alone does not suffice. Even though fitting the data, the coarse linear
model of volatility (Chapter 11) ran into trouble at a large number of
parties, predicting volatilities of more than 100%. It had to be modified
into an exponential model that avoids this absurdity. Let us review the
models previously offered, in this respect.
Logical testing
In Chapter 1, the model for the number of seat-winning parties (n)
when T seats available is n=T1/2. Look for extreme cases. The lowest
possible value of T is 1. Does the model yield a reasonable result? Yes,
it predicts n=1, which is indeed the only logically acceptable possibility.
There is no conceptual upper limit on T, and no contradictions can be
seen even for very high values.
Of course, both T and n come in integer numbers. Yet for most
integer values of T, the formula yields a fractional n. This is no
12. How to Test Models: Logical Testing and Testing with Data
problem, because we deal with average expectations. For T=14, we
calculate n=3.7. This means that 4 parties winning seats is somewhat
more likely than 3. It is also quite possible that 2 or 5 parties win seats –
it is just much less likely.
Does the model for the largest component (Chapter 4) fit even
under extreme conditions? The model is S1=T/n1/2. Suppose there is
only one component: n=1. The formula correctly yields S1=T.
Now suppose we have a federal assembly of 100 seats and the
number of federal units is also 100. If the term “federal unit” has any
meaning here, each unit would have one and only one seat, and this
applies to the largest unit as well. Yet the formula yields S1=100/
1001/2=10, leaving only 90 seats for the other 99 federal units! How
should we refine the model so as to avoid such an outcome?
When establishing the model, we started with T/n<S1<T. Actually,
we should specify that T/n<S1<T-(n-1)m, where m is the minimal
amount that must be left to each of the other n-1 components. If this
minimum is 1 seat, then T/n<S1<T-n+1, and the geometric mean of the
extremes is S1=(T-n+1)1/2(T/n)1/2. This is a pretty messy expression, but
yes, plug in T=100 and n=100, and we do get S1=(1)1/2(1/1)1/2=1, as we
should.
Can we still use S1=T/n1/2? Yes, as long as the number of items
allocated (T) is much larger than the number of components (n)
multiplied by the minimal share (m): T>>nm. Thus we should specify
the simplified model as follows:
S1=T/n1/2
[T>>nm].
If this condition is not fulfilled, switch to more refined model. Most
models apply only under special conditions, which should be stated.
The relationship is analogous for the exponential and linear models
for volatility. One is more refined, yet the other is so much easier to use
and most often works with sufficient precision. We just have to know
when we can approximate to what degree. Note that Exercise 11.4
introduces a further limit.
Consider next the model for cabinet duration (Chapter 6). We
already noted that the basic model for the number of communication
channels, c=n(n-1)/2, works also for the extreme cases of n=0 and n=1.
But what about the model for cabinet duration, C=k/N2? The lowest
limit on the number of parties is N=1. Then C=k. The best fitting value
of k has been found to be 42 years. If a one-party democracy came to
exist, the model predicts a cabinet duration of about 42 years. For a pure
two-party system (N=2), the model predicts 10 years. Does this feel
right? Here it is not a matter of clear conceptual limits but having
105
12. How to Test Models: Logical Testing and Testing with Data
surprise zones (as in Figure 11.1 for volatility). Consider the following
limits.
1) By the loosest definition, a cabinet is considered to continue as
long as it consists of the same party or parties. The ministers or even the
prime minister could change. Even so, we might be surprised if the
same cabinet continued beyond human life spans, say 80 years.
2) At the low side, consider a two-party system. One party is bound
to have a slight majority and form the cabinet, which is likely to last
until the next elections, short of a rather unusual rift within the ruling
party. Elections typically take place every 4 years. Hence average
durations of less than 4 years for two-party constellations would
surprise us. Plug C=4 years and N=2 into C=k/N2, and out pops k=
CN2=16 years.
In sum, values of k outside the range 16 years<k<80 years might
surprise us. The geometric mean of the non-surprise zone is 36 years.
The actual 42 years is close to 36 years, but this is sheer luck. After all,
the limits 16 and 80 years are not conceptual anchor points but quite
hazy limits of vague zones of surprise. The main thing is that 42 years is
in that zone.
Does C=k/N2 make conceptual sense at the other extreme, a very
large number of parties. As N∞ (N tends to infinity), C=k/N20. Yes,
we can imagine cabinet durations becoming ever shorter as the number
of parties increases. Of course, just as one-party democracies are hard to
imagine (although Botswana comes close, with effective number of
parties N=1.35), very short durations are hard to imagine in practice.
But models must not predict absurdities even under unreal circumstances.
The previous models applied to specific problems. In contrast, the fixed
exponent format Y=Xk applies to a broad spectrum of phenomena – all
those which can be normalized to fit into a square box, with anchor
points at two opposite corners – and no other conceptual anchor points.
However, one can start from either corner, so that (1-Y)=(1-X)k is
equally likely. If the anchor points are at top left and bottom right –
(0,1) and (1,0) – then the forms Y=(1-X)k and 1-Y=Xk apply. Many other
mathematical forms can join the opposite corners, but they are more
complex. They would have to be tried when the data refuse to follow
the simplest form or when there are logical reasons to expect a different
format.
106
12. How to Test Models: Logical Testing and Testing with Data
Testing with data
Models may be logically consistent, yet fail to agree with reality. We
may deduce logically that the pattern C=k/N2 should prevail, but we
may discover that the actual pattern is closer to C=k/N3 or to C=k/N1.5 –
or to no equation of the form C=k/Nn. How do we discover what the
actual pattern is?
One way is to graph all the raw data we have and compare to what
the model predicts. This is what we did in Exercise 2.1, for a few
countries. We may graph the curves C=k/N2 for selected values of k
such as k=30, 40 and 50 years and see if the clouds of data points fit
along one of them. If they do, the model fits. By trial and error, we
might find the best fitting value of k, but this is slow process. If, on the
contrary, the data cloud crosses the curves C=k/N2, then the model does
not fit, and we have to find a family of curves to which the cloud does
fit.
Curves are messy. It is so much easier to work with straight lines.
With the latter, the eyeball test often works. At a glance, we can see
whether the data points follow a straight line and what its slope is.
Fortunately, whenever a model involves only multiplication, division
and fixed exponents, taking logarithms turns it linear, as seen below.
Many models become linear when logarithmic are taken
Most of the models discussed up to now follow the generic form y=cak
(Table 12.1). This means they input variable with some simple
exponent (1, ½, 2, -1, -½, or -2,), and multiplied with some constant
(which could be 1). No additions or subtractions! Then the logarithms
of input and output have a linear relationship: y=logc+klogx.
Why do relationships tend to have such a form? We live in a world
where multiplicative relationships overshadow the additive. This is so
because we can multiply and divide quantities of a different nature:
Divide distance by time, and we get velocity. In contrast, we can add or
subtract only quantities of the same nature: distance plus distance, or
time minus time. Multiplicative relationships lead to curves rather than
straight lines.
Conversion to logarithms turns curved patterns into linear. Thus, it is
most useful to graph them on log-log paper. The coarse model for
volatility is different; testing it does not need logarithms. However,
logarithms enter the refined volatility model in a different way, to which
we’ll come later.
107
12. How to Test Models: Logical Testing and Testing with Data
Table 12.1. Many logical models have the form y=cak.
Generic form:
No. of seat-winning parties:
The largest share:
Cabinet duration:
Fixed exponent:
y=cak
y=logc+klogx
n=T1/2
 logn=0.5logT
S1=T/n1/2  logS1=logT-0.5logN
 logC=logk-2logN
C=k/N2
Y=Xk
 logY=klogX
constant
1
T
k
1
exponent
½
-½
-2
k
What can we see in this graph?
Figure 12.1 shows the result for cabinet duration in 35 democracies
(data from Lijphart 1999) graphed against the effective number of
parties, both on logarithmic scales. (Compare it to the curved pattern
you drew in Exercise 2.1 and the straightened patterns in Exercises 2.3
and 7.2.) The meaning of lines in this graph:
Thin solid line: best fit between logarithms.
Bold solid line: theoretically based prediction [C=42 years/N2].
Dashed lines: one-half and double the expected value.
What can we see in this graph? Stop reading and look
at the graph. List all the details you can see. What can
you conclude directly or by comparing the details you
see?
Students trained to deal only with directional models often see only that
“when N increases, C decreases.” This is akin to seeing only grass and
sky in a painting that also has humans and bushes and buildings and
much more.
Being able to read graphs to a full extent is a major part of basic
numeracy in social sciences. So this section is one of the most
important in this book. My own list is given at the end of the chapter.
Do not peek, before making your own list! Only this way will you learn
to see more than you have, up to the present.
108
12. How to Test Models: Logical Testing and Testing with Data
Figure 12.1. Mean cabinet duration vs. effective number of legislative
parties – Predictive model and regression line. Source: Taagepera and
Sikk (2007).
The tennis match between data and models
Testing the model with data might look like the time-honored simple
recipe: “hypothesis (model)  data collection  testing  acceptance/
rejection”.
However, this would oversimplify the process. It would be akin to
reducing a tennis match to “serve  respond  hit inside or outside 
win/lose”. Yes, this is a basic component of tennis, but much more is
involved before the “win/lose” outcome is reached. The basic “serve 
respond” component itself plays out many times over. So is it with
model testing, too.
Superficial data inspire the first coarse logical model. The model
may suggest looking for different data that better correspond to what the
model is about. But the same discrepancies between model and data
may also motivate search for a more refined model. Some hidden
assumptions may have entered the first round of model building; they
must be explicitly stipulated. For instance, the coarse model for
volatility implicitly assumed that any positive values of volatility are
acceptable, even those surpassing 100 percent. In sum, the process of
scientific research most often looks like an ascending spiral:
109
12. How to Test Models: Logical Testing and Testing with Data
Initial hunch (qualitative hypothesis)  limited data collection 
 quick testing  quantitatively predictive model (quantitative
hypothesis)  further data collection  testing  refined model
 testing  further refining of model or data  testing ...
The simple recipe above (hypothesis  data collection  testing)
represents a single cycle within this spiral.
The essential part of a predictive model is the predicted functional
form of relationship among the variables – “product of n and n-1
divided by a constant” for c=n(n-1)/2; “constant divided by N squared”
for C=42/N2. The model may include a constant or parameter (like k for
cabinet duration), which must be determined empirically.
Due to conceptual constraints, predictive models rarely are linear.
Linear approximations are useful in preliminary work, along with
graphical representations, to get a feel for the empirical pattern. They
are also useful at the very end, as practical simplifications. In order to
know when a simplification can be used, one must be aware of the
refined model.
Why would the simplest forms prevail?
This book has highlighted the models based on ignorance – or rather nearignorance, teasing the most out of what we know about constraints.
Conceptually forbidden zones, anchor points, and continuity in-between
those points are important parts of our knowledge. Asking what would
happen under extreme conditions can lead to insights, even when we agree
that such extremes will never materialize. When the impossible is
eliminated, the possible emerges with more clarity.
But why should we expect the simplest mathematical formats to apply,
among the many formats that also satisfy some obvious constraints?
Addressing a similar issue, physicist Eugene Wigner (1960) observed that
the physicist is a somewhat irresponsible character. If the relationship
between two variables is close to some well-known mathematical function,
the physicist jumps to the conclusion that this is it – simply because he does
not know any better options. Yet, it is eerie how often this irresponsible
approach works out. It is as if mathematics were indeed the language in
which nature speaks to us.
110
12. How to Test Models: Logical Testing and Testing with Data
What can we see in this graph? A sample list
At least the following can be seen in graph 12.1. Let us start from the
broad frame and then work toward the center – first the most prominent
features and then the local details.
 Mean cabinet duration, C, is graphed against the number of
parties, N.
 Both variables are graphed on logarithmic scales. (This is
harder to notice for N – the unmarked notches should be labeled
2, 3, 5. If this were regular scale, one would start from 0, not 1.)
 The data cloud visibly follows a roughly linear pattern, on this
log-log scale.
 The graph shows in bold the line that corresponds to k=42 years
for C=k/N2 and in dotted lines the lines that corresponds to
k=84 and 21 years, respectively – i.e., the double and one-half
of 42. It can be seen that all data points but one fall into the
zone between the dotted lines, and most points crowd along the
central line.
 The exponent 2 in the denominator of C=k/N2 corresponds to
slope -2 in logC=logk-2logN. Regardless of the value of k, all
lines that fit the model have this slope. The value of logk just
raises or lowers the line.
In sum,
 The data cloud does follow a straight line on log-log graph.
 This line does have a slope close to -2.
 This line corresponds to a value of k around 42 years.
 Nearly all data points are located in a zone along the line that
corresponds to an error of ÷2 for C.
The “parameter” k is not part of logical prediction – it is determined
precisely from this graph. When we put N=1 into C=k/N2, it becomes
C=k. This means that we can read off the values of k for the various
lines on the vertical scale at N=1, at the left side of the graph.
Recall our estimate that values of k outside the range 16 years<k<80
years might surprise us. Well, this is pretty much the same zone as the
zone between the dotted lines in the graph (21 years<k<84 years). Only
Mauritius (MRT) surprises us. Its cabinets seem to be much too shortlived, given its rather moderate number of parties. Was it unstable?
Here we can only ask the question; the graph does not supply an
answer. Whenever we have deviating cases, we should look for reasons.
Sometimes we find them, sometimes not. (Mauritius was not unstable.
To the contrary, it had an extremely long-lasting prime minister. He
111
12. How to Test Models: Logical Testing and Testing with Data
cleverly played the parties against each other, including and dropping
some all the while, which technically produced different cabinets.)
Still more is to be seen in the graph.
 The line labeled C=31.3/N1.757 is the best-fit line determined by
statistical means. Never mind for the moment how it’s
determined. Note the following: It is visually very close to the
line with slope -2, but its own slope is -1.757. It would reach
the left axis (N=1, where logN=0) at the height 31.3 years,
appreciably below 42 years. We can see that just a small drop in
slope (from 2 to about 1.75) can change the intercept (the C
value at N=1) quite a lot. But N=1 is an area with no data
points, which makes the intercept less informative than is the
slope.
 Finally, consider the values of R-square. As stated previously,
R2=1 expresses a perfect fit and R2=0 perfect scatter. Never
mind for the moment how it’s measured. Just observe that the
fit for this best fitting line with any slope is quite good
(R2=0.79). However, the fit for the best fitting line with
predicted slope (-2) is almost as good (R2=0.77). In view of
existence of a logical model and its empirical confirmation, we
have here a law, in the scientific sense of the term – the inverse
square law of cabinet duration, relative to the number of
parties.
Did you see most of this in the graph? Did you draw most of these
conclusions? If you did not, don’t blame yourself. It’s a learning process. The main message is: Study the graphs of data and modelpredicted curves very carefully. They may supply much more information than meets the eye at the first glance.
Testing with data follows much the same pattern in the case of the
number of seat-winning parties and the largest share. Each presents
different snags (like the issue of 100 seats for 100 subunits, for the
largest share). The example of cabinet duration suffices for the moment.
The main message is: In testing models with data, eyeballing on
graphs must precede statistical tests; otherwise the wrong type of
statistical approach might be adopted.
112
13. Getting a Feel for Exponentials and
Logarithms






We have 10a10b=10a+b, because the exponent simply adds the
number of zeros. Also 10a/10b=10a-b, which subtracts the
number of zeros, and (10a)b=10ab, which multiplies the number
of zeros. It results that 10-a=1/10a and 100=1.
Decimal logarithms are fractional exponents of 10 that lead to
the given number: logA is a number such that 10logA=A.
It follows that when numbers are multiplied, their logarithms add: AB=C  logA+logB=logC. Also y=Am 
logy=mlogA.
Keep in mind the following markers: log10=1, log1=0, and
log0-infinity.
The formulas established for exponents of 10 apply to
exponents of any other number n too: nanb=na+b, na/nb=na-b,
(na)b=nab, n-a=1/na, n0=1, and the b-th root of n is n1/b.
Natural logarithm (lnx) is just a multiple of logx: lnx=2.3026
logx. Conversely, logx=0.434 lnx.
By now you should be persuaded that there is no escape: One cannot do
basic logical models without logarithms and their counterparts, the
exponentials. If you know how to deal with them, you can bypass this
chapter. But make sure you really understand them, rather than just
applying rules you have memorized. We have introduced logarithms
very gradually – only to the extent they were indispensable for the
problem on hand. This approach also gave time for some basic notions
to sink in, before being flooded by more.
The somewhat different approach in this chapter still tries to keep it
as simple as possible. Previous gradual introduction may have reduced
anxiety. I do not want you to memorize “When numbers are multiplied,
their logarithms add” (although it’s true and useful) without internalizing where such a claim comes from. Only then are you prepared to
use them without hesitation, knowing what they mean. Then, if you
have made a mistake in calculations (as I often do), you’d be able to
smell it out and correct it. This is the hallmark of a professional.
13. Getting a Feel for Exponentials and Logarithms
Exponents
We use 103 as shorthand for 10×10×10. Thus, 103=1,000. More
generally, the “exponent” a in 10a is the number of zeros that come after
“1”. It follows that 101=10. Also
100=1,
given that here “1” is followed by no zeroes.
If we multiply 100 by 1,000, we get 100,000. Using exponent
notation, we have 102×103=105, which is 102+3. The numbers of zeros
add. This is how multiplication turns into addition. When multiples of
10 are multiplied, their exponents add:
10a10b=10a+b.
When multiplying 10 by itself 3 times, we get 100×100×100=
1,000,000. In exponent notation, 102×102×102=(102)3=106. Thus
(10a)b=10ab.
If we divide 10,000 by 10, we get 1,000. Using exponents: 104/101=103,
which is 104-1. Hence division of numbers leads to subtraction of
exponents:
10a/10b=10a-b.
Now consider the reverse division: 10/10,000 yields 1/1,000=0.001. The
previous rule makes it correspond to 101/104=101-4=10-3. Note that the “3” corresponds to the number of zeros that precedes “1”. We must also
conclude that
10-a=1/10a,
because multiplying both sides by10a we get 10a10-a=10a-a=100=1 on the
left, and also 10a(1/10a)=1 on the right.
For future use, all the equations in bold may be worth memorizing –
but only if you understand what’s behind them. This means you can
prove these relationships, if challenged. Otherwise, memorization does
you little good. Conversely, if you understand, all this may look so
natural that no memorization is needed.
114
13. Getting a Feel for Exponentials and Logarithms
Exercise 13.1
OK, now quickly, without looking at the above: Why must 100 be equal to 1?
Why must 10-a be equal to 1/10a? Why must 103 times 104 be equal to 107?
If you cannot respond, in terms of numbers of zeros, return to the beginning of the
section.
The formulas established for exponents of 10 apply to any other number
n too:
nanb=na+b
na/nb=na-b
(na)b=nab
n-a=1/na
n0=1.
It follows that the b-th root of n is n1/b. Also the b-th root of na is na/b.
These relationships frequently enter model building and testing.
Fractional exponents of 10
The next question may sound crazy: What could 101/2 or 100.5 stand for?
If you take the previous rule seriously, it would mean “1” followed by
one-half of a zero! It seems to make no sense. But hold it! Also consider
the previous rule 10a10b=10a+b. When multiplying 100.5 by itself, we
would get 100.5100.5=100.5+05 =101=10. But this is the very definition of
square root of 10, which is approximately 3.16, given that
3.16×3.16=10. Thus 101/2=100.5 stands for square root of 10 – it cannot
logically stand for anything else! Yes, it’s as if 3.16 stood for “1”
followed by one-half of a zero...
Now consider the cube root of 10, which is 2.154, because, as you
can check, 2.1543=10. We could then say that 2.154 is somehow like
“1” followed by one third of a zero, because 101/3×101/3×101/3
=101/3+1/3+1/3=101, which is “1” followed by a full zero.
What about exactly 3? It is somewhat less than 3.16 but much more
than 1.154. So it should be “1” followed by somewhat less than one-half
of a zero but much more than one third of a zero.
By now you may get the message: We can assign an exponent of 10,
a sort of a “fractional number of zeros”, to any number between 1 and
10. For instance, 2 is 10 with exponent 0.30. How can we prove it? Note
that 210=1,024. This is quite close to 1,000=103. Thus 210≈103. Take the
10th root on both sides: (210)1/10≈103)1/10. Multiply through, and we get
2≈100.30.
115
13. Getting a Feel for Exponentials and Logarithms
Decimal logarithms
This “fractional number of zeros to follow 1” – this is what the decimal
logarithm is. Thus, log3.16=0.500, log 2.154=0.333, and log2=0.30.
Hence, by definition 10log2=2. More generally, for any number A, logA
is a number such that
10logA=A.
When numbers are multiplied, their logarithms add. Indeed, consider AB=C. We can write it as 10logA10logB=10logC. It follows from
10a10b=10a+b that logA+logB=logC.
AB=C 
logA+logB=logC.
Also log(A2)=log(AA)=logA+logA=2logA. More generally,
log(Am)=mlogA.
Note that m enters here, not logm. In other words,
y=Am  logy=mlogA.
What could be logarithm of 0? Recall that log0.001=log(1/1000)=-3.
Each time we divide a number by 10 we subtract “1” from its logarithm.
How many times do we have to divide 1 by 10 so as to obtain 0? We’d
have to do it infinite times. Thus log0 tends toward minus infinity:
log0  -∞.
Hence 0 cannot be placed on a logarithmic scale. What about
logarithms of negative numbers? Let us say that they do not have any.
What are logarithms good for?
They turn multiplications into additions – but who needs going through
logarithms when one can just multiply? However, they also turn exponents into multiplications, and this is where one cannot do without
logarithms. And all too many logical models involve exponents.
Example. Take the expression Y=X 3.6 in previous Table 8.2. How did
I get Y=0.037 when X=0.40? For X=0.40, Y=0.403.6. Here we cannot go
ahead without applying logarithms: logY=3.6log0.40. A pocket
calculator with a LOG (or LOG/10x or log) key comes handy. On a
116
13. Getting a Feel for Exponentials and Logarithms
usual pocket calculator, enter 0.40, push LOG and get -0.398. (On some
calculators, you must first push LOG, then 0.40 and then “=”.)
Multiplying by 3.6 yields -1.4326.
Now take the “antilog” of -1.4326, which means taking 10-1.4326. On
most pocket calculators, once you have -1.4326 entered, push “2nd
function” and “LOG”, and you get y=10-1.4326=0.0369≈0.037.
Shortcut. Many pocket calculators offer a shortcut for Y=0.403.6– the
x
“y ” key. Enter 0.4, push “yx”, enter 3.6, push “=”, and we get 0.037
directly. Does such a calculator by-pass taking logarithms? No. It just
automatically takes log0.40, multiplies by 3.6, and takes the antilog.
Logarithms on other bases than 10
Logarithms can be established on bases other than 10. This may make it
confusing. The only other type of logarithms needed for most models is
the ”natural ” one, designated as “ln”. It is based on the number
e=2.718… instead of 10. By definition, lne=1, just as log10=1. What’s
so “natural” about 2.718…? We’ll come to that. When logx means
logarithm to the base 10, then lnx is simply logx multiplied by 2.3. More
precisely:
lnx = 2.3026 logx.
Conversely,
logx = 0.434 lnx.
Note that ln10=2.3026.
The previously established relationships still apply. In particular,
AB=C 
lnA+lnB=lnC,
and
ln(Am)=mlnA.
Many pocket calculators have separate keys for LOG (and 10x for
antilog) and LN (and ex for antilog). We’ll address the natural logs
when the need arises.
117
14. When to Fit with What





The conceptually forbidden areas inevitably constrain relationships between two variables. Conceptual anchor points and
ceilings add further constraints.
When the entire field y vs. x is open, try linear fit. Graph y vs. x
and see if the pattern is straight.
When only one quadrant is allowed, try fixed exponent fit.
Graph logy vs. logx and see if the pattern is straight.
When two quadrants are allowed, try exponential fit. Graph
logy vs. x and see if the pattern is straight.
When the resulting pattern is not linear look for further
constraints – like those for simple logistic growth and boxes
with 3 anchor points.
We like to have linear fits. Sometimes we get a linear fit between x and
y. Sometimes, instead, we get a linear fit between logx and logy. We
have seen cases like that. But sometimes we get a linear fit between
logy and x itself (not its logarithm), as we’ll soon see. And sometimes
it’s even more complex. How does one know which way to go? Our
basic observation is that
the conceptually forbidden areas inevitably constrain
relationships between two variables. Conceptual
anchor points and ceilings add further constraints.
These constraints make some forms of relationship impossible, while
imposing some other forms. We should start with the simplest mathematical format that satisfies such constraints. We should add more
complexity only when data do not fit the simplest format. Such lack of
fit usually means that further logical constraints have been overlooked.
It is amazing, though, how often nature (including social nature)
conforms to the simplest forms. (Recall “Why would the simplest forms
prevail?” in Chapter 12).
Suppose that we have two data points, (x1,y1) and (x2,y2). What is the
simplest format to connect them, without leading to logical inconsistencies? Such an inconsistency would result, if such a format
predicts an impossible value of y for a possible value of x, or vice versa.
14. When to Fit with What
Unbounded field – try linear fit
About the only situation where a linear model is justified is when both x
and y can conceivably take any values – from minus infinity to plus
infinity. The allowed area is an “unbounded field” (Figure 14.1). Apart
from time and space, such quantities are rather rare. Their zero point
tends to be arbitrary, and hence there are no logical anchor points.
Figure 14.1. When to use linear regression on unmodified data.
Unbounded field (any x, any y)
Zero point often arbitrary: (0)

+∞
Try fitting with y=a+bx
(linear pattern)
b=+1
a
c
(0)
(0)
-∞
+∞
b=-1/2
a
-∞
c
If the field is unbounded, no transformation of data is needed. Just
graph y vs. x. If the data cloud is linear, then y=a+bx applies. Then we
can draw in the visual best-fit line, y vs. x, or use a more formal
statistical method. How can we find the coefficients a and b in y=a+bx?


Intercept a is the value of y where the line crosses the y axis
(because here x=0.)
Slope b is the ratio -a/c, c being the value of x where the line
crosses the x-axis (because here y=0.)
However, we can also find the coefficient values in y=a+bx from any
two suitable points:

Take two points, far away from each other: x1,y1 and x2,y2.
These should be “typical” points in the sense of being located
along the axis of the data cloud, not high or low compared to
most neighboring points.
119
14. When to Fit with What


For y=a+bx we have b=(y1-y2)/(x1-x2). Then a=y1-bx1.
When a=0 is imposed on logical grounds, so that the line is
forced to go through (0,0), the equation is reduced to y=bx.
Then b=y1/x1.
One quadrant allowed – try fixed exponent fit
Most often, however, we deal with quantities that cannot go negative:
populations, votes, and parties. Then, only one quadrant of the open
field is allowed in a graph y vs. x (Figure 14.2). Moreover, there is a
natural zero that cannot be shifted: zero persons, zero parties, etc. E.g.,
for a country with zero square kilometers, it is reasonable to expect zero
population and zero parties. Here we should consider the fixed exponent
pattern y=Axk. When k is positive, it has an anchor point at (0,0). This
format does not lead to absurdities. In contrast, many straight lines do,
predicting a negative value of y for some positive values of x.
For the straight line, we calculated the parameters a and b that
describe a particular line. How do we calculate the parameters A and k
for a fixed exponent curve? This will be shown toward the end of the
chapter.
Figure 14.2. When to try the fixed exponent format.
120
14. When to Fit with What
Two quadrants allowed – try exponential fit
An intermediary situation arises when one of the quantities can range
from minus infinity to plus infinity (such as time), while the other
cannot go negative (such as population). In a y vs. x graph, two
quadrants of the open field are now allowed (Figure 14.3). There is a
natural floor at zero persons, but no natural zero for time, and hence no
anchor point. Here we should consider the exponential pattern, y=A(Bx).
This can also be expressed as y=A(ekx), where e=2.71… is the basis of
natural logarithms. We have not yet discussed the exponential pattern,
but it is very important in life and in sciences. Bank deposits at a fixed
interest grow exponentially. All young biological beings initially grow
exponentially; so do some social or political organizations.
The fixed exponent equation y=A(xk) and the exponential equation
y=A(Bx) may look confusingly similar. The first has x to fixed exponent
(power), while the second has a constant to exponent x. The difference
in outcomes is huge
Figure 14.3. When to try the exponential format.
How to turn curves into straight lines
Humans are pretty good at telling whether a line is straight, while
various curves may look all the same to us. Suppose the conceptually
allowed area suggests a fixed exponent or exponential relationship. It
helps if we can transform the data in such a way that they would form a
straight line – if the relationship is truly a fixed exponent or exponential.
121
14. When to Fit with What
Then we could graph the transformed data and see at a glance whether
the transformed data cloud looks straight. When this is the case, linear
regression of the transformed data is justified. If the transformed data
cloud still looks bent or otherwise odd, we would have to ponder why
this is so and what we have to add to the model so as to straighten out
the data cloud.
When the transformed graph does show a linear relationship,
y=a+bx, we should calculate or estimate the values of a and b. From
these we can determine the parameters of the original model – A and k
in y=Axk, and similarly for exponentials. It can be done without pushbutton regression, and in fact, this hands-on approach is sometimes
preferable. My experience is that students cannot understand and
interpret computer-generated regression outputs unless they have
acquired the ability to do rough graphs by hand and calculate the
parameters using nothing more than a pocket calculator. The description
of how to proceed is presented next.
Calculating the parameters of fixed exponent equation in a single
quadrant
If we expect fixed exponent pattern y=Axk, because only one quadrant is
allowed, taking logarithms leads to linear relationship between logy and
logx: log y=log A+klogx. Designating logA as a takes us to the familiar
linear form (logy)=a+k(logx). Hence we should graph logy vs. logx. If
the transformed data cloud is linear, then y=Axk applies. Then we can
regress logy vs. logx. How can we find the coefficients A and k in
y=Axk? We can do it in two ways.
Finding the coefficient values in y=Axk from special points on the loglog graph:
 Coefficient A is the value of y where the line crosses the logy
axis (because here log x=0 and x=1).
 Exponent k is the ratio -A/c, c being the value of logx where the
line crosses the logx axis (because here logy=0.)
Finding the coefficient values in y=Axk from any two points on the
original curved graph y vs. x:
 Take two “typical” points in the data cloud, far away from each
other: (x1,y1) and (x2,y2).
 For y=Axk we have k=log(y1/y2)/log(x1/x2). Then A=y1/(x1k).
 Special case: If A=1 is logically imposed, the equation is
reduced to y=xk. Then k=logy1/logx1.
122
14. When to Fit with What
Calculating the parameters of exponential equation in two quadrants
If we expect exponential pattern y=A(Bx), because only two quadrants
are allowed, taking logarithms leads to linear relationship between logy
and non-logged x: logy=logA+x(logB). Designating logA as a and logB
as b takes us to the familiar linear form (logy)=a+bx. Hence we should
graph logy vs. x itself. If the data cloud is linear, then y=A(Bx) applies.
Then we can regress logy vs. x (unlogged).
There are often good reasons to use the alternative exponential
expression y=A(ekx) and natural logarithms (ln). By definition, lne=1.
Hence the logarithms are related as lny=lnA+kx=a+kx. We again graph
lny vs. x itself. If the resulting data cloud is linear, then y=Aekx applies.
Then we can regress lny vs. x itself. Recall that natural (lnx) and
decimal (logx) logarithms relate as lnx=2.30logx and, conversely,
logx=0.434lnx. Often we can use either logarithm.
How can we find the coefficients A and B in y=A(Bx), or A and k in
y=Aekx? In principle, we can again do it in two ways – using special
points on the semilog graph or using two points on the original curved
graph y vs. x. However, semilog graph papers use decimal logarithms,
and we may get confused when shifting from log to ln. So it is safer to
use the two-point formula:
 Take two “typical” points of the data cloud, far away from each
other: (x1,y1) and (x2,y2).
 For y=A(Bx) we have logB=[log(y1/y2)]/(x1-x2). Then B=10logB
and A=y1/(Bx1).
 For y= A(ekx) we have k=[ln(y1/y2) ]/(x1-x2)]. Then A=y1(e-kx1).
This is often the more useful form.
In general, when the exponential format applies, use y=A(ekx) rather
than y=A(Bx). Previous chapters have given a fair number of examples
of using y=Axk. Examples of using y=A(ekx) will be given in Chapter 20.
Instead of ekx we sometimes write exp(kx). Why? Suppose we divide two
exponentials and get exp[k(x1-x2)]. If we try to fit all of k(x1-x2) “upstairs”, into the
exponent location, this can become confusing.
Constraints within quadrants: Two kinds of “drawn-out S” curves
Further constraints can enter. It may be that x and y are logically
restricted to only part of the positive quadrant. The box 0≤x≤1 and
0≤y≤1 is one marked example. With logical anchor points at 0,0 and
1,1, the model y=Axk still applies – and it even simplifies to y=xk. But if
a third anchor point imposes itself at 0.5,0.5, a “drawn-out S” pattern
results, as shown and briefly discussed at the end of Chapter 9. Its
equation can be written as
123
14. When to Fit with What
y/(1-y)=[x/(1-x)]k.
This means that log[y/(1-y)]=klog[x/(1-x)]. To test fit to this model,
graph log[y/(1-y)] vs. log[x/(1-x)] – it should produce a straight line. If it
does so, the model fits, and k is the slope of this line on the log-log
graph.
Figure 14.4. Exponential curve and simple logistic curve which starts
out the same way.
S
Exponential
M
Simple Logistic
M/2
0
0
t
A quite different “drawn-out S” pattern is shown in Figure 14.4. This
comes about when an exponential growth pattern faces a ceiling, like all
biological beings do. This means that y is squeezed into the zone
between a conceptual floor and ceiling – two quadrants allowed, but
with a ceiling on y. When the ceiling on y is M and y reaches M/2 at
time t=0, then the simplest model is the simple logistic equation
y=M/(1+e-kt).
A larger k means steeper growth from near-zero to near-ceiling. This
equation can be expressed (see EXTRA below) as
y/(M-y) = ekt
124
14. When to Fit with What
and hence
log[y/(M-y)] = kt.
To test fit to the simple logistic model, graph log[y/(M-y)] vs. t – it
should produce a straight line. If it does so, the model fits, and k is the
slope of this line on the semilog graph. This is the basis for statistical
packages of LOGIT regression.
The simple logistic curve starts out as an exponential curve. It also
approaches the ceiling in an exponential way. We already encountered
an exponential approach to a ceiling when dealing with volatility.
The simple logistic curve looks like a “drawn-out S", like the threeanchor point curve in a box. But the ends of the logistic “S” range from
minus to plus infinity, while the three-anchor point curve ranges only
from 0 to 1.
EXTRA: How do we get from y=M/(1+e-kt) to y/(M-y)=ekt?
M-y=M-M/(1+e-kt)=[M(1+e-kt)-M]/(1+e-kt)= )=[Me-kt]/(1+e-kt).
Hence y/(M-y)={M/(1+e-kt)}/{Me-kt/(1+e-kt)}= ekt.
* Exercise 14.1
This exercise is really basic for understanding when to fit with what.
For each of the cases described below, do the following. 1) Sketch a
graph showing the constraints (forbidden areas and anchor points). 2)
Sketch the broad shape of the resulting curves. 3) Give the general form
of the corresponding equation; no need to calculate any of the
parameters. 4) Indicate how we should transform the inputs and outputs
so as to obtain a straight line on regular graph paper. (NOTE: Use the
simplest assumptions possible. Go along with problem statements as
given. If you think they are unrealistic, point it out in a post scriptum,
but address the exercise first.)
a) A sum of 1000 euros is placed in a savings account earning 3%
interest per year. How does this sum (S) change over time (t)?
[Show graph as S vs. t, not y vs. x. Show t over several decades
– this helps to bring out the trend.]
b) As television expands in a country, rural populations tend to
have fewer TV sets than urban populations, because of expense
and access. What could be the relationship between the
percentages of households with TV in rural (R) and urban (U)
areas? [Show graph as R vs. U, not y vs. x.]
c) As radio sets expand in a country, rural populations at first tend
to have fewer radio sets than urban populations, because of
expense and access. Later on, however, radio sets become more
widespread in rural than urban areas, because city people have
125
14. When to Fit with What
more of other means for information and leisure. Assume that
the tipping point comes when radio ownership reaches 50%
both in urban and rural areas. What could be the relationship
between the percentages of households with radio sets in rural
(R) and urban (U) areas?
d) A few families reach an uninhabited island and find it suitable
for agriculture. Over decades, their numbers expand, until they
began to run out of arable land. How does their population (P)
change over time (t)?
Exercise 14.2
At the ECPR (European Consortium for Political Research) Workshops
in Rennes 2008, Staticia presented a paper that connected the subventions (S) paid by the European Union to various Euroregions to the
per capita GDP of those regions (G). She graphed S against logG, found
a pretty nice linear fit, and concluded: support decreases as wealth
increases.
However, the model inherent in this fit violates at both extremes the
maxim “Models must no predict absurdities”. Explain the nature of
absurdities that result, and offer a better model. The intervening steps
are as follows.
a) Show logG on the x-axis, G being GDP per capita in euros. This
means marking 1, 10, 100, 1000 and 10,000 on the axis, at
equal intervals.
b) Show subventions S on the y-axis, S being in units of million
euros. This means marking 0, 100, 200, 300… on the axis, at
equal intervals.
c) Enter a straight line at a reasonable negative slope. This is how
Staticia’s graph looked, with data points crowded around it.
d) Now statistics must yield to thinking. What is the subvention
predicted by this straight line at extremely large GDP/cap? Is it
possible?
e) And how large would the subvention be when a region’s
GDP/cap is zero? Could the European Union afford to pay it?
(Hint: Where on the x-axis do you place 1, 0.1, 0.01, 0.001,
…?)
f) What shape does Staticia’s straight line assume when we graph
S vs. G, rather than S vs. logG? Hint: Graph S vs. G, starting
both from 0, and enter the predictions from parts (d) and (e).
g) Which model would you recommend in view of the following:
(1) neither S nor G can be negative; (2) when G=0, then S is
some pretty high figure; and (3) when G increases, then S
decreases. Sketch the corresponding curve on the graph S vs. G.
(NOT S vs. logG!)
126
14. When to Fit with What
h) Write the simplest equation that satisfies those conditions.
i) If Staticia wishes to run a linear test for this logically supported
model, which quantities should she graph? Compare to the
quantities she did graph (S and logG).
j) But Staticia did get a satisfactory statistical fit, by social science
norms. How is this possible, if she graphed in an illogical way?
That’s an unfair question, unless I also give you her data so you
can visualize it. The answer is that, over short ranges, many
equations can fit a fairly dispersed data cloud – but most of
them lead to absurdities under extreme condtions. NOTE:
Actually, what we have expressed here should stand for per
capita subvention. For total subvention, the population of the
region should also be taken into account, not only their wealth.
Staticia could ignore this only because the regions have roughly
similar populations.
127
C. Interaction of Logical Models and
Statistical Approaches
15. The Basics of Linear Regression and
Correlation Coefficient R2







The Ordinary Least Squares (OLS) procedure of linear regression minimizes the sum of the squares of distances between
data points and the line.
Regressing y on x minimizes the squares of vertical distances.
Regressing x on y minimizes the squares of horizontal distances. Two different OLS lines result.
These two OLS lines are directional, not algebraic. This means
that if we first estimate y from x, using ya+bx, and then
estimate x from y, using xa’+b’y, we don’t get back the
original x.
By OLS, a tall twin’s twin tends to be shorter than her twin.
If we first estimate y from x and then z from y, we don’t get the
same value as when estimating z directly from x – OLS regression is not transitive.
Correlation coefficient R2 expresses the degree of lack of scatter
of data points. Utter scatter means R2=0. Points perfectly on a
line mean R2=1.
In contrast to OLS slopes, R2 is symmetric in x and y.
Testing logical models with data can involve the use of statistics.
Logical models often enable us to transform data so that a linear
relationship is expected. Therefore, the basics of linear regression
should be clarified here, even though developing skills in statistics is
outside the scope of this book. You should acquire those skills, but
somewhere else. What is very much within our scope is when and how
to use these statistics skills – and as important, when NOT to use them
blindly. Statistics books tend to go light on the latter aspect, and even
when they point out inappropriate uses, students tend to overlook these
sections. The damage can be serious.
15. The Basics of Linear Regression and Correlation Coefficient R2
Regression of y on x
So let us proceed to linear regression. Suppose that we have 5 moderately scattered data points in a field with no constraints such as
forbidden areas or anchor points, as shown in Table 15.1 and Figure
15.1. The mean values of x (x̄ ) and y (ȳ) are also shown. The point (x̄ , ȳ)
is the “center of gravity” of data set.
Table 15.1. Hypothetical data.
x
y
-3.0
-0.5
0.5
2.0
2.5
0.5
3.5
3.0
4.5
5.0
Mean: x̄ =1.6
Mean: ȳ =2.0
We want to pass the “best fitting” line through these points. What do we
mean by “best fitting”? That’s the catch. There are many ways to define
it. But they all agree on one thing: The best-fit line must pass through
the center of gravity of the data cloud.
The “Ordinary Least Squares” (OLS) method to regress y against
x proceeds as follows: We try to minimize the sum of the squares of
the vertical deviations from data points to the line.
What does this mean? Draw a haphazard line through the data cloud.
No, not quite – it must pass through the center of gravity. Draw vertical
lines joining the data points to the line (thick lines in Figure 15.1). The
lengths of these line segments show how far the data points are from the
line, in the vertical direction. Then draw in squares having these lines as
their vertical sides. Some of the squares are large while some others are
so tiny they do not show up in the graph. Measure the areas of squares
and add them. Now try to tilt the line around the center of gravity so
that the sum of the squares is reduced.
130
15. The Basics of Linear Regression and Correlation Coefficient R2
Figure 15.1. Vertical distances of 5 points to a random line drawn
through their center of gravity (mean x, mean y). [CAUTION: THE
LINE SHOWN Is PLACED A BIT TOO HIGH. x=1.6 is between the
2nd and 3rd points from the left; and y=2.0 is at the level of the 2nd point.
All Figures using these data must redone, and they must show the center
of gravity.]
y
Too
large
Too
large
x
The middle point along the x-axis has a large square – more than the
other squares combined, but tilting the line closer to this point does not
help much because this point is so close to the center of gravity. (We
cannot shift the line away from the center of gravity!) But it pays to tilt
the line to a steeper slope, so that the square on the far left is reduced to
nothing. The line shown in Figure 15.2 is close to optimal. It passes
near the point on the far left, and it balances off the two large squares to
the right. Any further tilting of this line would increase the sum of the
squares. So this line is close to the best fit by the OLS procedure.
131
15. The Basics of Linear Regression and Correlation Coefficient R2
Figure 15.2. Vertical distances of 5 points to the line which roughly
minimize the sum of squares, meaning the best fit y on x.
y
x
This is the basic idea behind “regression” to the least squares line. We
do not have to carry it out through such graphical trial and error.
Equations have been worked out which do the regression exactly. The
calculations are rather simple, but they become quite tedious when there
are many data points. So these calculations are best left to computer
programs. Just feed the coordinates (xi, yi) of all data points into the
computer program, and out pops the OLS line ya+bx, which best
predicts y from x.
The danger is that by leaving everything to the computer we do not
develop a “finger tip feel” for what OLS does – and what it cannot do.
If we apply OLS to improper data (what is improper will be explained
later), the computer program for OLS does not protest – it still
calculates a line, even if makes no sense. Junk in  junk out. If you
draw mistaken conclusions from this line, don’t blame the method –
blame the one who misapplied a perfectly good method.
Reverse regression of x on y
Now do the reverse – try to predict x from y – and the picture changes.
Previously we measured the vertical distances from points to the line;
now we must measure the horizontal distances. Figure 15.3 shows the
132
15. The Basics of Linear Regression and Correlation Coefficient R2
same line as in Figure 15.2, plus horizontal distances to the data points
and their squares. Visibly, this line no longer is the best-fit line – some
squares are so large they hardly can be shown within the printed page.
Tilting this line to a steeper slope could strongly reduce the sum of these
horizontally based squares.
Figure 15.3. Previous best fit line, y on x, and horizontal distances to
the 5 points: The sum of squares clearly is not minimized.
y
x
This tilting is done in Figure 15.4. It shows both the previous best-fit
line (dashed), labeled “OLS y-on-x”, and a new one, along with its
squares, labeled “OLS x-on-y”. This line passes close to the point on the
upper right, and it balances off the large squares for the lowest two
points. The sum of squares is visibly smaller than in previous graph.
This is the best line for predicting x from y: xa’+b’y. It is far from the
best line for predicting y from x: y=a+bx. The crossing point of the
two best-fit lines is at mean x and mean y.
The important conclusion is that the OLS lines depend on the
direction in which we proceed:
133
15. The Basics of Linear Regression and Correlation Coefficient R2
Regressing y on x (minimizing the squares of vertical
distances) leads to one OLS line; regressing x on y
(minimizing the squares of horizontal distances)
leads to a different OLS line.
The reverse OLS line x-on-y is always steeper than
the OLS line x-on-y, when x is on the horizontal axis.
Figure 15.4. Horizontal distances of 5 points to the line that roughly
minimizes the sum of squares based on them – the best fit x-on-y. The
best-fit y-on x is the dashed line.
y
OLS y-on-x
x
OLS x-on-y
Directionality of the two OLS lines: A tall twin’s twin tends to be
shorter than her twin
Directionality of OLS lines is nice for some purposes and disastrous for
some others. Suppose that we take female twins and randomly assign
one of them to Group X and the other to Group Y. Then we graph their
heights, y vs. x. The logical model which applies to this graph obviously
is y=x, which also means x=y. It is non-directional. Boringly obvious?
But wait a bit!
134
15. The Basics of Linear Regression and Correlation Coefficient R2
Suppose I tell you that the twin X is 190 cm (6 ft. 3 in.) and ask you
to guess at the height of her twin. You are paid or penalized depending
on how close your guess is. If you offer 190 cm, you have more than a
50-50 chance of overestimating the twin’s height. Why? You know that
190 cm is way over the mean height. The chances that the twin of such
a tall woman matches such an unusual height are less than 50-50. So I
would guess at 188 cm (6’2”) at most – and maybe even lower. How
much lower? This is precisely what OLS y-on-x is about. A tall twin’s
twin tends to be shorter than her twin, and OLS tells you by how
much, on the average. Compared to the slope 1 of y=x, the OLS slope
is somewhat less than 1.
Suppose the scatter in heights of twins is such that, for 50-50
chances of under- and overestimating, OLS tells us to guess at y=187
cm when x=190 cm. If we next were told that a person in the Y group is
187 cm, what would be our guess for her twin? Certainly not 190! This
would mean using the y-on-x OLS line, y=a+bx, going in the wrong
direction. Indeed, 187 cm is still unusually tall, so we would guess at
about 185 for her sister. This means we now are using the opposite xon-y line. In the usual y vs. x way of graphing, this OLS slope is
somewhat more than 1.
Where would such a game end? Our successive estimates would
approach the median height of women, but with ever-smaller
increments. Thus we never reach the median height itself, technically
speaking. In the reverse direction, if we were told that the first twin is
140 cm (4’7”), we would guess at her twin being somewhat taller – so
we again approach the mean height.
Are you getting dizzy? I would. To make sure I really understand, I
would need a specific example, and I would need to work it out on my
own, rather than observing someone else doing it. This is what Exercise
15.1 is about.
Exercise 15.1
The best-fit line y on x in Figure 15.4 is y=1.06+0.59x. This is what we
would use to estimate y from x. The best-fit line x on y is y=0.57+0.89x.
But this line is used to estimate x from y, so we have to transpose it to
x=-0.64+1.12y. To keep track of which equation goes in which
direction, we better write them as y1.06+0.59x and
x-0.64+1.12y.
a) Graph the presumed data (from Table 15.1) and the two lines.
They cross at the center of gravity G, where x=1.6, y=2.0. These
are the mean values X and Y of x and y, respectively.
b) If another country has x=4.0, what would be the corresponding
value of y? Call this point A.
tle.ot?
135
15. The Basics of Linear Regression and Correlation Coefficient R2
c) Use this value of y to calculate the corresponding value of x.
Call this point B. har.n
d) Use this value of x and calculate the corresponding y. Call this
point C. zu.tso?
e) Show the points A, B, and C on the graph, as well as arrows A
to B and B to C.
f) What would happen, if we continue this game? Where would
we eventually end up?
g) If we write and apply a computer program for the process
above, how much time might the process take, with today’s
computer capabilities, to reach perfectly this end point?
h) What would happen, if we start with a very low value of x?
Where would we now end up?
In sum, OLS is the best way to predict an unknown quantity from a
known one – a practical purpose, which is directional. But the scientific
“law”, on which such “engineering” is based, is still y=x=y. It is not “a
tall twin’s twin tends to be shorter than her twin, who herself tends to be
shorter than her twin”, which would mean y<x<y. In today’s social
sciences, this distinction all too often is missed, and this has hurt the
development of social sciences.
Most logical models are not directional. They follow “algebraic”
equations where y=x also means x=y. This applies to pretty much all the
equations one encounters in secondary school mathematics. Symbolically,
[xy] = [yx]
algebraic equations
The OLS equations, however, are directional:
[xy] ≠ [yx]
OLS regression
At low scatter, the difference is negligible. At high scatter, it becomes
enormous.
For logical models expressed as algebraic equations, a single line
must work in both directions. It’s a fair guess that its slope should be
intermediary between the lower slope of y-on-x and the higher slope of
x-on-y of the two OLS lines. This symmetric linear regression will be
presented in the next chapter. For the moment, just keep in mind that it
matters in which direction one carries out standard OLS. The custom of
showing OLS equations as y=a+bx is misleading – it really is y
a+bx.
136
15. The Basics of Linear Regression and Correlation Coefficient R2
Non-transitivity of OLS regression
Suppose logical considerations suggest that x has an impact on y, which
in turn has an impact on z. Symbolically: xyz. Average cabinet
duration (C) is one such example. For a given assembly size, the
number of seats in the electoral district (district magnitude M) largely
determines the effective number of parties (N), which in turn largely
determines cabinet duration: MNC. Sometimes we may already
have calculated N from M, and we want to use it to calculate C. This
means MN, followed by NC. Some other times we may wish to
estimate C directly from M: MC. We expect that it should not matter,
which way we go. The outcome should be the same, regardless of
whether it is MNC or MC.
This is what transitivity means, and it applies to the algebraic
equations. Symbolically:
[xyz] = [xz]
algebraic equations
The trouble with OLS regression equations is that, in contrast to
algebraic equations, they are not transitive. When we regress the
number of parties on district magnitude and then regress cabinet
duration on the number of parties we get one relationship between
district magnitude and cabinet duration. We get a different one when
regressing cabinet duration directly on district magnitude: MNC is
not the same as M C. At low scatter, the difference is negligible. At
high scatter, it can become enormous. Symbolically:
[xyz] ≠ [xz]
OLS regression, high scatter
Why is this so? This follows from lack of directionality. If we cannot go
back to the same value of x, after passing through y (Exercise 15.1),
then we cannot reach the same value of z when going there through y
rather than directly. Most logical models are transitive, like MNC.
Thus OLS regression works in model testing only when scatter is low –
which often works out in physics but rarely does in social sciences.
Correlation coefficient R2
Along with the coefficients a and b in OLS linear regression equation
ya+bx, one usually reports R2. This number expresses the degree of
lack of scatter of data points. If the points line up perfectly, R2 is 1. If
the points form a blob without any direction, then R2 is 0.
137
15. The Basics of Linear Regression and Correlation Coefficient R2
Figure 15.5. Linear regression presumes a roughly elliptic data cloud,
without curvature.
One can use just R. In addition to scatter, R also indicates whether the
slope is up or down. Figure 15.5 offers some schematic examples.
Imagine that the ellipses shown are rather uniformly filled with data
points, and there are none outside the ellipse. The slimmer the ellipse,
the higher R2 is. Why is R2 used more frequently than R, when the latter
gives more information? It can be said that R2 expresses the share of
variation in y that is accounted for by the variation in x. The rest of the
variation in y is random noise, insofar as dependence on x is concerned.
For the data in Figures 15.1 to 15.4, R2=0.66.
So as to give more substance to these schematic examples, Figures
15.6 and 15.7 reproduce previous Figures 11.2 and 12.1, respectively.
Figure 15.6 (R2=0.51, hence R=0.511/2=+0.71) is akin to the first ellipse
in Figure 15.5, tilting up (R positive) and moderately slim. Figure 15.7
(R=0.7871/2=-0.89) is akin to the last ellipse in Figure 15.5, tilting down
(R negative) and quite slim.
138
15. The Basics of Linear Regression and Correlation Coefficient R2
Figure 15.6. Individual-level volatility of votes vs. effective number of
electoral parties: data and best linear fit from Heath (2005), plus coarse
and refined predictive models. For best-fit line, R2=0.51.
CONCEPTUALLY FORBIDDEN AREA
100
Anchor point
1
V=
60
40
(
1.9
N-
1 .1
+1
.07
9
V=
CONCEPTUALLY
FORBIDDEN
Individual level volatility (%)
80
-0.14
0(1-e
V= 10
1)
4N
))
3(N-1
v
20
0
0
1
2
3
n4
5
CONCEPTUALLY FORBIDDEN AREA
6
7
8
9
10
Effective number of electoral parties
Figure 15.7. Mean cabinet duration vs. effective number of legislative
parties – Predictive model and regression line. Source: Taagepera and
Sikk (2007).
Notes:
Thin solid line: best fit between logarithms.
Bold solid line: theoretically based prediction [C=42 years/N2].
Dashed lines: one-half and double the expected value.
139
15. The Basics of Linear Regression and Correlation Coefficient R2
Exercise 15.2
In Figures 15.4, 15.5, and 15.6, take a pencil and lightly sketch ellipses
that encompass the data points. Comparing the slimness of these ellipses
to those in Figure 15.5, estimate the corresponding R-squares. Compare
your results to the actual R-squares (0.66, 0.51, and 0.79, respectively).
Both notation and names in the literature can be confusing. Kvålseth (1985) presents no
less than 8 different expressions for R2 that appear throughout the literature. They most
often yield approximately the same result, but for some odd data constellations they can
differ. Some are called coefficients of dispersion or of determination. Some sources
distinguish between R2 and r2. At the risk of omitting significant differences, this book
uses just “correlation coefficient R2”. It is always positive, when applied to the scatter of
the data cloud as such.
Figure 15.8. When scatter is extensive, the two OLS lines diverge from
the main axis of an elliptic data cloud tilted at 45 degrees.
y
R2 – measure of slimness
of data cloud
Symmetric line – main
axis of data cloud
C
x
OLS y-on-x –
Scatter-reduced
slope
OLS x-on-y –
Scatter-enhanced slope
c
140
15. The Basics of Linear Regression and Correlation Coefficient R2
The R2 measures lack of scatter: But scatter along which line?
In previous Figure 15.5, the main axis of the ellipse (dashed line) is
visibly the best-fit line for all three data clouds. Is this line the OLS
regression line? But if so, which one would it be – y-on-x or x-on-y?
Actually, the main axis is intermediary between the two OLS lines; we
will see that it’s the symmetric regression line (as long as the main axis
is roughly at 45 degrees). All three lines pass through the center of
gravity C of the data cloud.
For the slim data cloud on the right in Figure 15.5, the three
regression lines are practically the same. But for the almost round data
cloud in the center they diverge, as shown in Figure 15.8. Here the yon-x line has a much shallower slope than the central line, while the line
x-on-y has a much steeper slope. This is a general feature of OLS
regression: As scatter increases, the slope of y on x is reduced, while the
slope of x on y is enhanced. As these “scissors” between the two OLS
lines widen, R-square decreases.
Although R-square often is reported along with a single OLS line (y
on x), it has exactly the same value the other OLS line (x on y). It goes
with the combination of both lines.
141
16. Symmetric Regression and its Relationship
to R2






Symmetric regression line minimizes the sum of rectangles
formed by vertical and horizontal distances from data points to
line.
The slope B of the symmetric regression line is a pure measure
of slope, independent of scatter. Similarly, R2 is a pure measure
of lack of scatter, independent of slope.
Together, B and R2 tell us how tilted and how slim the data
cloud is – B expresses the slope of the main axis, and R2 the
slimness of the ellipse.
The slopes of the two OLS lines result from a combination of
these pure measures. They are mixtures of slope B and scatter
R2 .
The symmetric regression equation is multi-directional and
transitive. It is an algebraic equation, suitable for interlocking
relationships.
Being one-directional, OLS equations cannot represent interlocking relationships.
We have seen that there are two OLS lines, and their equations are
directional and non-transitive. They are so because they treat x and y
asymmetrically, minimizing squares of deviations either in the vertical
or in the horizontal direction. Testing of logical models might be on
safer grounds with a regression method that treat x and y in a symmetric
way.
2
16. Symmetric Regression and its Relationship to R
From minimizing the sum of squares to minimizing the sum of
rectangles
How could we regress so that x and y enter in a symmetric way? Ask
first the reverse question: What caused the asymmetry in the OLS
procedure? This came about because we measured either the vertical
distances between data points and the line, or the horizontal. Well, take
them now both into account, on an equal basis. Start with the OLS line
y-on-x in previous Figure 15.2, but show both the vertical and horizontal
distances of points to line (Figure 16.1).
These lines form the two sides of rectangles. The two remaining
sides are shown as dashed lines. Now look at the areas of these
rectangles. Could we reduce their sum? Visibly, a steeper line could
reduce the areas of the two largest rectangles. Compared to the two OLS
lines in Figure 15.4, an intermediary line minimizes the sum of
rectangles, as seen in Figure 16.2. To minimize clutter, only two sides
of the rectangles are shown. (In fact, the entire argument could be made
on the basis of the areas triangles delineated by the vertical and
horizontal distances and the line.)
Symmetric regression line minimizes the sum of
rectangles formed by vertical and horizontal distances
from data points to line.
Figure 16.1. Vertical and horizontal distances of 5 points to the best-fit
line y on x.
y
x
143
2
16. Symmetric Regression and its Relationship to R
Figure 16.2. Vertical and horizontal distances of 5 points to the line that
roughly minimizes the sum of rectangles (or triangles) based on them –
the best fit symmetric in x and y.
y
OLS y-on-x
x
Symmetric
regression
line
OLS x-on-y
Symmetric regression line equations are multi-directional and transitive.
In this sense, they are algebraic equations. In terms of the example in
the preceding chapter, symmetric regression MN, followed by
regression NC yields the same result as direct regression MC.
Symbolically:
[xyz] = [xz]
symmetric regression equation
This is crucial for establishing interlocking relationships (Chapter 10).
Interlocking relationships need two-directional equations. Hence
they cannot be based on OLS regressions. They can be based on
symmetric regression.
How R-squared connects with the slopes of regression lines
Slope B of the symmetric regression line is the geometric mean of the
slopes (b and b”) of the two standard OLS lines in relation to the same
axis:
B=±(bb”)1/2,
144
2
16. Symmetric Regression and its Relationship to R
the sign being the same as for b and b”. It can also be shown that Rsquared is the ratio of the two OLS slopes:
R2=b/b”.
In sum, the following picture emerges. The slope B of the symmetric
regression line is a pure measure of slope, independent of scatter.
Similarly, R2 is a pure measure of lack of scatter, independent of slope.
Together, B and R2 tell us how tilted and how slim the data cloud is – B
expresses the slope of the main axis, and R2 the slimness of the ellipse.
The slopes of the two OLS lines result from a combination of these pure
measures. They are mixtures of slope B and scatter R2 –slopes reduced
or enhanced by scatter.
For full description of the data cloud, we also need the coordinates
of its center of gravity (mean x and y), where the three regression lines
cross. Proof of the relationships above is left to the EXTRA section. It
involves more mathematics than the rest of this book.
Exercise 16.1
On top of your graph in Exercise 15.1 also graph the symmetric
regression line and determine its equation. Proceed as follows.
a) The two OLS lines are y1.06+0.59x and y0.57+0.89x.
Calculate the slope (B) of the symmetric line as the geometric
mean of the slopes of the two OLS lines..hetzuzöt
b) Pass the line with that slope through the center of gravity G(1.6,
2.0).
c) The symmetric line has the form y=a+Bx. To find a, plug x=1.6,
y=2.0 into this equation and solve for a.
.tmeneg
145
2
16. Symmetric Regression and its Relationship to R
Figure 16.3. OLS and symmetric regression lines for three points.
y
1 B
C
D
A
0
1
2
x
* Exercise 16.2
Figure 16.3 shows a most simple case: just 3 points (A, B, and D).
a) Copy the data points on square paper so as to be able to place
the regression lines accurately. However, use the same scale on
both axes (in contrast to Figure 16.3) – one unit must have the
same length on both axes.
b) Specify two points on the OLS regression line y on x. How?
Ask which value of y minimizes the squares of vertical
distances to the line when x=0. Do the same at x=2.
c) Draw this OLS line on the graph. Calculate its equation. .ot -.zuzot
d) Now specify two points on the OLS regression line x on y.
Draw it, and calculate its equation. oah. –egy.
e) Calculate the coordinates of the center of gravity C. How? At
the point where two lines cross, the same value of y satisfies the
equations of both lines. .tlethar .stasba
f) Calculate the slope of the symmetric regression line. Compare it
to the slope of the line joining the points B and D. (CAUTION:
Here we have a point labeled B, but the same symbol is used in
text for the slope of the symmetric line. Keep the two “B”s
separate in your mind.) .hrnol .stahet
g) Calculate the intercept a for the symmetric regression line.
How? This line, too, passes through the center of gravity. So
plug its slope and the coordinates of C into y=a+Bx.
h) Draw in the symmetric regression line.
i) Draw conclusions.
146
2
16. Symmetric Regression and its Relationship to R
EXTRA: The mathematics of R2 and the slopes of regression lines
This section, more mathematical than the rest of this book, is not
obligatory reading. We can get the formulas for the OLS lines in any
basic statistics book, but few of these books even mention symmetric
regression. The present section enables us to make the connection, if
one is interested.
All three regression lines pass through the center of gravity G of the
data cloud. The coordinates of G are mean x (X=xi/n) and mean y
(Y=yi/n) of all the data points. Indices i refer to the coordinates of
individual data points. The slope B of the symmetric regression line can
be calculated directly from data (Taagepera 2008: 173–174):
B =±[(yi -Y)2/(xi -X)2]1/2.
With indices dropped and the axes shifted so that the means become
X=Y=0,
B = ±[y2/x2]1/2.
The sign of B (+ or -) is the same as for correlation coefficient R. This
formula may not look symmetric in x and y, given that y is on top and x
at the bottom. But keep in mind that any slope stands for dy/dx. When
we introduce B=dy/dx, the result is symmetric in x and y:
(dy)2/Σy2=(dx)2/Σx2. This slope B is a pure measure of slope in the sense
that, if we increase random scatter, this slope does not systematically
shift up or down.
The formula for R2 is also symmetric in x and y:
R2=[(yi-Y)(xi-X)]2/[(xi-X)2(yi -Y)2].
With indices dropped and X=Y=0, it becomes
R2=[xy]2/[x2y2].
It is a pure measure of lack of scatter in the sense that it does not depend
on slope.
In previous pictures of roughly elliptic data clouds (Figures 15.5 and
15.8), B and R2 tell us how tilted and how slim or roundish the data
cloud is – B reflects the slope of the main axis (as long as the main axis
is roughly at 45 degrees), and R2 the relative width of the ellipse. This
means that the slope of the symmetric regression line is the scatter-
147
2
16. Symmetric Regression and its Relationship to R
independent complement to R2. (For fuller description, we also need the
coordinates of the center and the range of values of x or y.)
The slopes of the two OLS lines result from a combination of these
pure measures. They are mixtures of slope B and scatter R2. We usually
measure the slope of OLS y on x relative to the x axis (b=dy/dx). But
the slope of OLS x on y might be measured either relative to y axis
(b’=dx/dy) or relative to x axis (b”=dy/dx), so that b’b”=1. So we have
three relationships, as pictured in previous Figure 15.8:
b=|R|B
OLS y on x – scatter-reduced slope.
b’=|R|/B OLS x on y, slope relative to y axis, scatter-reduced.
b”=B/|R| OLS x on y, slope relative to x axis, scatter-enhanced.
It follows that b/b”=R2, when the slopes of both OLS lines are measured
with respect to the x axis. If further random fluctuation is imposed on a
given data set, R2 is reduced. This means that the ratio of the slopes of
the OLS lines is reduced. How does this reduction come about? The
slope of the OLS line y-on-x (b) is reduced, while the slope of the OLS
line x-on-y (b”) is enhanced, so that both contribute to reduction in R2.
But this means that the slope of each OLS line is affected by the degree
of scatter. Hence the OLS slope is a mixed indicator of steepness of
linear trend and of scatter around this trend.
Note that R2 can be visualized as degree of lack of scatter, either
along the symmetric line or along the combination the two OLS lines. It
would be misleading to visualize R2 as degree of lack of scatter along a
single OLS line. This can be seen in Figure 15.8: The scatter expressed
by R2 is distributed lopsidedly around the OLS line, while it is
distributed evenly around the symmetric regression line. When R2 is
low, reporting a single OLS slope along with R2 effectively means
counting the real slope at half-weight (B in |R|B) and the degree of
scatter at one-and-a-half weights (R2 plus |R| in |R|B).
It also follows from the equations above that the symmetric slope B
is the geometric mean of the slopes of the two standard OLS lines in
relation to the same axis (b and b”):
B=±(bb”)1/2,
the sign being the same as for b and b”. When the slope of y-on-x is
measured with respect to the y axis (b’=1/b”), the relationship is
B=±(b/b’)1/2.
148
17. When is Linear Fit Justified?




A linear fit of data is acceptable only if the data cloud is uniformly dispersed within a roughly elliptic area, with no visible
curvature or structure. Even then the linear fit must not violate
logical constraints.
Always graph all data y vs. x before pushing the regression
button. Then carry out linear regression only when it makes
sense.
The use of R2 is subject to the same limitations.
The least squares and symmetric regression methods are both
quite sensitive to outliers, and so is R2.
When we are given just the equation of the appropriate regression line
plus R2, but no graph, we tend to imagine what was shown in Figure
15.8: a roughly ellipse-shaped data cloud, tilted according to the slope b
in y=a+bx, with ellipse slimness corresponding to the value of R2. This
is an adequate mental picture under the following conditions:



the center of gravity C (mean x, mean y) does express the center
of the data cloud in a meaningful way;
the regression line does express the main axis of the data cloud
in a meaningful way; and
R2 does express the dispersion of data around this line in a
meaningful way.
Recall that the main axis of the data cloud is the line of symmetric fit
(when its slope is close to 1). If instead, y=a+bx stands for OLS
regression y-on-x and R2 is low, it really means ya+bx . We would
then tend to underestimate the actual slope of the ellipse main axis. But
this may be a relatively minor distortion. Real trouble is that all too
often data clouds do not look at all like ellipses.
17. When is Linear Fit Justified?
Many data clouds do not resemble ellipses
Data clouds may look like bent sausages or even like croissants, as in
Figure 17.1. (The data points and regression line were copied from a
published graph; center of gravity C and dashed parabola have been
added.) Here linear regression is misleading because



the center of gravity lies in a zone where few data points occur;
the y-on-x regression line passes through a zone with few data
points and does not express the configuration, conjuring in our
minds the false image of a tilted ellipse; and
the value of R2, low as it is bound to be in Figure 17.1, wrongly
conjures the image of an almost circular data cloud (as in Figure
15.5, center) rather than the actual complex configuration.
Symmetric regression would be no better. No simple curve fits here, but
a roughly parabolic fit (dashed curve) would be more expressive of the
pattern than any straight line. How do we know whether our data cloud
is roughly elliptic or not? The blunt advice is:
Always graph all data y vs. x before pushing the regression
button.
Then carry out linear regression only when it makes sense.
Whenever the data cloud has even a slight curvature (bent sausage
rather than a straight one), consider some data transformation so as to
straighten the pattern, before regressing. How do we carry out such a
transformation? Recall Chapter 14. Applying linear regression to data
configurations not suited for it can lead to monumental mischaracterization of the data. This is so important that further cautionary
examples are due.
150
17. When is Linear Fit Justified?
Figure 17.1. A case where any linear regression would be misleading.
The OLS line y-on-x is shown. The complementary OLS line y-on-x,
also passing through center of gravity C, would be almost vertical.
y5
5
5
5
5
C
5
5
5
5
x
Grossly different patterns can lead to the same regression lines and
R2
Consider the examples in Figure 17.2. Assume we have no information
on any conceptual limitations. Those four data configurations have been
designed so that they all lead to exactly the same linear regression lines:
The center of gravity is the same: x=9.00, y=7.50. OLS y-on-x yields
y=3.00+0.50x, OLS x-on-y yields y=0.75+0.75x, and symmetric
regression line is y=1.9+0.61x. The four configurations also have the
same correlation coefficient, a pretty high one (R2= 0.67), if we deemed
it appropriate to apply a linear fit. But in which of these cases does a
linear fit make sense?
151
17. When is Linear Fit Justified?
Figure 17.2. Data that correspond to exactly the same linear regression
lines and R-square (Anscombe 1973 for y1 to y3, Taagepera 2008: 201
for y4). But does a linear fit make sense?
y1
10
5
x
0
0
10
20
y2
10
5
x
0
0
10
20
y3
10
5
x
0
0
10
20
y4
10
5
x
0
0
10
152
20
17. When is Linear Fit Justified?

Constellation y1: A linear fit looks acceptable because the data
cloud is uniformly dispersed, with hardly any visible curvature.
One could draw an ellipse around the data points, and the
crowdedness of points would be roughly the same throughout
the ellipse. (True, one might detect an empty region in the lower center,
meaning that a slightly bent curve would fit better. But in the absence of
conceptual constraints we might gloss over it.)



Constellation y2: The points fit neatly a parabolic-looking curve,
and a corresponding transformation should be applied before
linear fitting. Linear fit to raw data would be ludicrous. Random
deviation from a regular pattern is much less than intimated by
R2=0.67. The parabolic transformation could be based on
statistical considerations, but this would also be prime time for
asking: why is it that y first rises and then falls with increasing
x?
Constellation y3: It has 10 points perfectly aligned, while one
point is a blatant outlier. This point clearly does not belong and
should be omitted, before carrying out regression. (The
statistical justification for deletion is that it deviates by more
than 3 standard deviations.) When this outlier is omitted, the
slope is slightly lower than previously calculated, and R2
approaches 1.00. One should try to figure out how the outlier
came to be included in the first place. Maybe there was a typo
in the data table – this happens.
Constellation y4: The pattern is far from a rising straight line.
We observe two distinct populations where y actually decreases
with increasing x, plus an isolate. This pattern should make us
wonder about the underlying structure: Why is there such an
odd pattern? Reporting only the rising regression line would
misrepresent the data.
Note that none of the peculiarities of the three latter cases would be
noticed, if one just used tabulated data and passively went on to linear
regression. One must graph the data! The use of R2 is subject to the
same limitations. If only the linear regression results are reported, we
would imagine a data cloud like y1, and never imagine that it could be
like the three others.
153
17. When is Linear Fit Justified?
Exercise 17.1
Make an exact copy of Figure 17.2. If you can’t use a copy machine,
paste this graph on a well-lighted window, paste blank paper on top of
it, and trace carefully all the points.
Add the symmetric regression line y=1.9+0.61x to all 4 graphs.
[Note that the center of gravity (9.00, 7.50) is on this regression line.]
The discrepancies characterized above will become more apparent.
(CAUTION: The scales on the four graphs are somewhat different, so
one must scale off the distances on each of them separately. Sorry for
this inconvenience.)
* Exercise 17.2
Make an exact copy of Figure 17.2. If you can’t use a copy machine,
paste this graph on a well-lighted window, paste blank paper on top of
it, and trace carefully all the points.
All four configurations in Figure 17.2 have the same arithmetic
means for x and y: x=9.00, y=7.50. Place this point on all four graphs.
(CAUTION: The scales on the four graphs are somewhat different, so
one must scale off the distances on each of them separately. Sorry for
this inconvenience.)
a) In which cases, if any, would the use of these particular values
of arithmetic means be justified, because they adequately
characterize something about the data cloud?
b) In which cases, if any, would calculation of arithmetic means be
justified once something has been done with the data?
c) In which cases, if any, should one use means different from
arithmetic?
d) In which cases, if any, should one give up on trying to calculate
any kind of means, because they would leave a mistaken
impression of the actual configuration?
Do not assume that one and only one case fits each question!
Sensitivity to outliers
Linear regression is trickier business than some social scientists realize.
In particular, the least squares regression method is quite sensitive to
extreme values (outliers), and so is R2 (Kvålseth 1985). Symmetric
linear regression shares the same problem. This is illustrated by the
third example in Figure 17.2. While the single outlier affects the center
of gravity only slightly, it makes the slope of the regression lines much
steeper and lowers R2 from nearly 1.00 to 0.67. Any point far out of a
generally ellipse-shaped data cloud can mess up the results.
154
17. When is Linear Fit Justified?
Outliers can legitimately be excluded from a set by some statistical
considerations, such as being off by three (or even just two) standard
deviations, in an otherwise normal distribution. There are also more
refined regression methods that minimize their impact, such as Tukey’s
outlier-resistant method (Kvålseth 1985). The main thing is to know
when trouble looms, so that one can consult. How does one smell
trouble? By graphing and eyeballing.
Empirical configuration and logical constraints
Linear regression, be it standard OLS or symmetric, may be used only
when linear fit is justified in the first place. When is it, and when isn’t
it? One must check for acceptable configuration of the empirical data,
as has been done above. In addition, conceptual agreement with constraints is also needed. Conceptual agreement includes passing through
anchor points and avoidance of forbidden areas.
Thus for volatility in Figure 15.6 the best linear fit V-on-N narrowly
misses the anchor point at N=1. (The reverse fit of N-on-V would err in
the opposite direction. The symmetric regression line would be inbetween, and hence very close to the anchor point.) The linear model
does respect the anchor point but goes through the ceiling at V=100 – so
we have to treat the linear model with great caution, as a merely local
model. Linear fit is acceptable only when volatility is much below 50
(which most often is the case). Compatibility of linear fits with logical
constraints will surface again and again later on.
The published graph that is copied in Figure 17.1 actually includes
more information: each data point carries the name of a country in
Europe. Upon inspection, all the top points pertain to Northern and
Central Europe, and bottom points to Southern Europe. Hence there is
no parabola. There is one downward trend line in Northern Europe, and
an upward one in Southern Europe! The authors of the published article
missed it. They did the right thing by graphing and labeling the data
points. But then they forgot to ask: “What can we see in this graph?”
(Cf. Chapter 12.) They went blindly statistic and missed the main
substantive feature.
This happens often, when data graphs are published. All too often
only statistical analysis is published, without graphs. Then one can only
wonder what could be hidden behind it. It could be garbage like fitting
the data in Figure 17.1 with a single OLS line, or nuggets like discovering different trends in Northern and Southern Europe.
155
17. When is Linear Fit Justified?
Exercise 17.3
In Figure 17.1, assume that all points above the OLS line pertain to
Northern and Central Europe, and those below the line to Southern
Europe. Sketch in the separate best-fit lines for the two groups.
Exercise 17.4
In the top graph (y1) of Figure 17.2, join the top 6 points with a curve,
then do the same for the bottom 5 points. Does the pattern still look
straight? Compare to the next pattern (y2).
156
18. Federalism in a Box





Even the best graphs published in social sciences can be
improved on.
Placing a data graph in a frame that does not correspond to the
logically allowed area can distort our impression of the data.
Showing a regression line that extends into logically forbidden
area can distort our impression of the data.
Recall that there is no single OLS regression line. When data
are scattered, the slopes of the two OLS lines (y on x, and x on
y) diverge wildly. A symmetric regression line with an intermediary slope is available.
When the conceptually possible range of x goes from a to a
larger number b, it can be converted to range 0 to 1, by X=
(x-a)/(b-a).
This chapter presents further examples of thinking inside a box,
comparing linear regression to fixed exponent curves. No full-fledged
quantitatively predictive logical model could be developed in these
cases – it is not that easy! The point is to show that more could be done
in the direction of model building than at first met the eye. Arend
Lijphart’s Patterns of Democracy (1999) is a landmark in the study of
democratic institutions, including various aspects of federalism.
Constitutional rigidity and judicial review
One might expect that federalism needs an appreciable degree of
“constitutional rigidity”. This means a written constitution that is hard
to change, because this is the only way to specify and protect the rights
of the federal subunits. Rather than measured quantities, Lijphart uses
here informed estimates. He does so, on a 1 to 4 point scale, where 1
stands for no written constitution and 4 for one that is extremely hard to
modify.
A related concern is “judicial review of laws”. This means that
courts can pass judgment on whether a law conforms to the constitution.
Lijphart again uses a 1 to 4 scale, where 1 stands for no judicial review
and 4 for a very powerful one. One cannot declare a law unconstitutional when there is no constitution! Thus a rating “1” on constitutional rigidity should exclude anything above “1” on judicial review.
This looks like a logical anchor point. But how are these two features
related otherwise, if at all?
18. Federalism in a Box
Figure 18.1 shows first Lijphart’s (1999: 229) graph of judicial
review vs. constitutional rigidity. Then it shows the same graph with
some additions. Cover up the bottom part, for the moment, and focus on
the top. What do we see? There is an almost spherical data cloud in the
center of the field, while the extremes are uninhabited. And there is a
regression line, which is only slightly tilted. This suggests that the
extent of judicial review increases with constitutional rigidity, but only
mildly. The text in Lijphart (1999: 229) states that the correlation
coefficient is 0.39. This corresponds to a very low R2=0.15, which
reflects the extreme scatter of data points.
Figure 18.1. Judicial review vs. constitutional rigidity: Original graph
(Lijphart 1999: 229), and with addition of conceptually allowed region
and several regression lines.
Note: SOME SCANNED-IN FIGURES MAY NOT COPY IN
NORTH AMERICA.
158
18. Federalism in a Box
However, the picture changes when the borders of the conceptually
allowed region are drawn in (from 1 to 4 on both axes), as is done in the
bottom graph. What looked like areas devoid of data points are actually
areas where no data points can possibly occur! As for the conceptually
allowed region, almost all of it is inhabited fairly uniformly. Only the
top left corner is empty: Stringent judicial review based a non-written
constitution does not occur – as one would logically expect. But
otherwise, anything seems to go. Stringent constitutional rigidity can go
with absolutely no judicial review (Switzerland, SWI). Most surprising,
countries with no unified constitutional document can still have
appreciable judicial review (Columbia, COL). These features does not
stand out in the original graph.
Why is the impression from the two graphs so different? It’s due to
the meaningless frame in the original graph. It misleadingly intimates
that data points could run from 0 to 5, rather than the actual 1 to 4.
Moreover, the regression line is extended into the forbidden region.
This elongated line increases the impression of fair correlation when
there is almost none.
How did this happen? Lijphart tells me that he drew his graphs by
hand, using a 1 to 4 square. The person who redrew the graphs on
computer thought it looked nicer with an expanded frame. It may look
nicer – but it leaves a wrong impression. In science, truth must come
before beauty. Placing a data graph in a frame that exceeds the
logically allowed area can distort our impression – and so can
showing a regression line which extends into logically forbidden
area.
There are further problems with the regression line. Any data y vs. x
lead to two OLS regression lines. The one used in the top figure
regresses Review on Rigidity. The points furthest from the overall trend
affect the OLS line in a major way. On the right side of Figure 18.1,
SWI pulls the line down, US pulls the line up, and the intermediary
Australia (AUL) hardly matters.
Now turn the graph by 90 degrees so that the judicial review axis
becomes horizontal and constitutional rigidity increases upwards. Try
the same balancing act. Compared to the previous OLS line, the data
points IND, GER and US at the far left all pull downwards. The
balanced position is close to GER. At the far right, all the points,
ranging from SWI to ISR, NZ and UK, pull upwards, compared to the
previous OLS line. The balanced position might be in between NET and
UK. The overall balance among all the points is approximately as
shown in the lower graph of Figure 18.1. This is the reverse OLS
regression line, Rigidity on Review.
159
18. Federalism in a Box
The two OLS lines diverge wildly, because scatter is huge, as
reflected in R2=0.15. The bottom graph also adds the symmetric
regression line, which treats the two variables on an even basis (Chapter
16). This is the real “trend line”, if any straight line applied at all.
Is any linear fit acceptable on conceptual grounds? First, does the
figure have any conceptual anchor points through which an acceptable
fit must pass? An anchor point at (1,1) can be asserted: One cannot
declare a law unconstitutional when there is no constitution. At the
opposite extreme, it’s less clear. The lack of a single written document
still doesn’t mean total absence of basic norms. This is why Iceland
(ICE) can have a mild degree of judicial review. On the other hand,
even an utmost degree of constitutional rigidity doesn’t call for judicial
review (witness Switzerland, SWI) – it just makes such a review
possible and highly likely. In sum, (1,1) at bottom left might be our best
guess in the absence of any other information. The symmetric regression line passes close to it, while the OLS lines deviate widely, in
opposite directions.
A similar vague claim of being an anchor point might be made for
top right corner (4,4). But most countries have judicial reviews at level
2, regardless of constitutional rigidity. Hence one might sketch in a
“toppled S” shaped curve, joining (1,1) and (4,4), but almost horizontal
as it passes through the center of gravity (the crossing point of linear
fits). Its equation would be more complex, akin to those we’ll discuss in
connection with Figure 21.7.
What have we achieved, in terms of logical model building? We
merely have cleared away some possible misconceptions, by delineating
the allowed area. This by itself has value. As for positive model
building, face the real world. Many relationships observed, in nature
and society, remain fuzzy, unless we get some new ideas.
We may ask whether a third factor might affect the picture, one that
differs for countries toward the top left (COL, ITA, IND) and for
countries toward the bottom right (SWI, JPN, FIN, NET, LUX).
Inspection of the graph reveals no social or political commonalities
among the members of these two groups. Maybe such a factor will be
discovered, but it may also be that the scatter remains random noise,
based on past history.
Degree of federalism and central bank independence
Figure 18.2 shows another case in Lijphart (1999: 241). Here the
degree of federalism and decentralization is estimated on a scale from 1
to 5. The central bank independence is a compound of many estimates,
160
18. Federalism in a Box
which makes it a quasi-continuous measure. In principle, it could range
from 0 to 1.0.
Figure 18.2. Central bank independence vs. degree of federalism:
Original graph (Lijphart 1999: 241), and addition of conceptually
allowed region and the reverse regression line. CORRECTION: The
labels “y on x” and “x on y” must be reversed.
161
18. Federalism in a Box
In the original graph the data cloud seems to occupy most of the space
available, except top left and bottom right. Introducing the conceptually
allowed region alters the picture. It shrinks the field left and right, while
expanding it – little at the bottom but markedly at the top. Now we see
that both extremes on the x-scale occur widely – utter centralization as
well as complete federalism. In contrast, the extremes of complete or no
central bank independence do not occur at all. Compared to the
conceptually allowed range, from 0 to 1.0, the actual range of central
bank independence is rather limited, from 0.2 to 0.7.
Correlation is somewhat stronger than in the previous case, with
R2=0.32, hence the two standard OLS lines are closer to each other.
(The symmetric regression line, in between, is not shown.) What about
conceptual anchor points? Imagine a central bank with zero independence. What degree of federalism would we expect? Certainly no more
than the minimal “1”. Now imagine a central bank with full independence. We might expect full federalism (“5”). So the opposite corners of
the allowed area just might be conceptual anchor points. To join them,
while respecting the data, would take again a “toppled S” shaped curve,
which roughly follows the y-on-x (mistakenly shown as “x-on-y) line at
median degrees of federalism.
Once again, we have made only little headway toward a quantitatively predictive logical model, by establishing at least the conceptually
allowed region and anchor points.
Bicameralism and degree of federalism
Figure 18.3 (from Lijphart 1999: 214) might allow us to go slightly
further. Bicameralism is estimated on a scale from 1 to 4. Delimiting the
allowed region shows two empty corners. Accordingly, correlation is
higher (R2=0.41) than in previous graphs, and the two OLS lines are
relatively close to each other. We might claim that full-fledged
federalism does call for full bicameralism – two equally powerful
chambers – so that both population and federal subunits can be
represented on an equal basis. Indeed, for x=5, 4 out of 5 data points
have y=4. So (5,4) could be a logical anchor point.
It is trickier for x=1. Even some purely unitary countries have two
chambers on historical grounds. (The second chamber used to represent
aristocracy.) We may still tentatively accept an anchor at x=1, y=1. This
location is heavily populated with empirical data points, even though
many countries with x=1 have y>>1. If we accept an anchor at 1,1, then
a fit with Y=Xk could be tried.
162
18. Federalism in a Box
Figure 18.3. Bicameralism vs. degree of federalism: Original graph
(Lijphart 1999: 214), and addition of conceptually allowed region,
reverse regression line, and a fit with Y=Xk.
163
18. Federalism in a Box
To do so, we must first convert the ranges 1 to 5 for x and 1 to 4 on y to
ranges from 0 to 1. How do we do that? First, we must pull down the
lower limit, from 1 to 0, by subtracting 1. The total span of possible
values on x is 5-1=4. On y, it is 4-1=3. So we have to divide by these
spans. The result is X=(x-1)/4 and Y=(y-1)/3. Check that now the lower
anchor point corresponds to X=0, Y=0, and the upper anchor point
corresponds to X=1, Y=1. Any curve Y=Xk would join these points, but
most of them leave most data points on one side of the curve. By trial
and error, we find that Y=X0.5 has about an equal number of data points
on either side. This curve is shown in Figure 18.3.
Do we now have a logically grounded model? This is hardly so.
Conceptually, the anchor points aren’t strongly imposed, and some data
points deviate markedly from the central curve. For X=0, we often have
Y>>0. Conversely, for Y=0, several cases with X>0 also occur. Still,
(0,0) and (1,1) are the most heavily populated points in their neighborhoods. The curve Y=X0.5 acknowledges this, while the linear regression lines do not – not even the symmetric one (not shown in Figure).
So Y=X0.5 has advantages in representing the broad trend. Compared to
linear regression equations, it has vastly more chances of eventually
finding a theoretical justification or explanation.
In all these graphs, we were dealing with subjective estimates rather
than objective measurements. It’s better than nothing, although we feel
on firmer grounds with measurements. All these examples have conceptual constraints on all 4 sides. This makes a fit with the format Y=Xk
conceivable, but it clearly does not work in Figures 18.1 and 18.2. All
relationships are not linear, nor do they all follow Y=Xk even when the
allowed area is a box.
164
18. Federalism in a Box
Conversion to scale 0 to 1
This is something we always have to do when the logical model or format
presumes a scale from 0 to 1, but the data uses some other scale. In the last
example, we had to convert the ranges 1 to 5 for x and 1 to 4 on y to ranges
from 0 to 1, so as to be able to apply the simple format Y=Xk. We reasoned it
through, using these particular ranges. The need to convert to scale 0 to 1
occurs so frequently, that a general conversion formula might be useful.
Suppose the original conceptually possible range of x goes from m to a
larger number M. Those numbers can be negative or positive. First, we pull
the lower limit down, from m to 0, by subtracting m from all values of x. We
obtain x-m. The total span of possible values is M-m. So we must divide the
values x-m by these spans. The result is
X=(x-m)/(M-m).
Check that the lowest possible value, x=m, leads to X=0, Y=0, and the
highest possible value, x=M, leads to X=1.
When m and M are positive and M is vastly larger than M (M>>m), then
it might make sense to use logx rather x in the conversion:
X=(logx-logm)/(logM-logm)=log(x/m)/log(M/m).
An example will be given in Figure 25.6.
* Exercise 18.1
A frequent scale goes from -10 to +10 (e.g. for left-right placements).
Convert it to scale 0 to 1. This means simply plugging these extreme
values into the equation above. Check the result by feeding into the
equation the values x=-10, 0 and +10. No need to calculate X for other
values of x!
* Exercise 18.2
Using a scale 1 to 10 can be awfully confusing, because it is so close to
0 to 10. Respondents may place themselves at 5 when they wish to
place themselves at the center.
a) What is the actual center on a scale 1 to 10?
b) Convert scale the 1-to-10 to scale 0-to-1. This means simply
plugging these extreme values into the equation above. No need
to calculate X for specific values of x!
c) Use the resulting equation to convert the value 5 on the scale 1to-10 to scale 0-to-1.
d) Have I cured you from ever using scales 1-to-10 or 1-to-5, in
preference to 0-to-10 or 0-to-5?
165
18. Federalism in a Box
Exercise 18.3
a) In Figure 18.1, sketch in a “toppled S” shaped curve, joining
(1,1) and (4,4), but almost horizontal as it passes through the
center of gravity.
b) Comment on its degree of fit to the data, compared to linear
regression lines.
Exercise 18.4
a) In Figure 18.2, sketch in a “toppled S” shaped curve, trying to
fit the data points as well as possible. This means having a
roughly equal number of data points on both sides of the curve,
both at low and high levels of federalism. It’s not easy.
b) Comment on its degree of fit to the data, compared to linear
regression lines.
166
19. The Importance of Slopes in Model Building








The direction and steepness of slopes of curves y=f(x) is
expressed as dy/dx. The slope dy/dx is often designated briefly
as y’ when it does not lead to confusion.
For exponential growth S=S0ekt, slope is proportional to size:
dS/dt=kS.
For simple logistic growth S=M/(1+ekt), slope is proportional to
size and to closeness to ceiling: dS/dt=kS(1-S/M).
For fixed exponent function y=Axk, the slope is dy/dx=
kAxk-1=k(y/x).
The cube root law of assembly sizes is an example of use of
slopes in constructing a logical model.
When functions are added, their slopes add:
y=y1+y2  dy/dx=dy1/dx+dy2/dx, or more briefly: y’=y1’+y2’.
When functions are multiplied together, the resulting slope
cross-multiplies slopes and functions:
P=xy  dP/dt =y(dx/dt)+x(dy/dt), or P’=yx’+xy’.
When two functions are divided, it becomes more complex:
Q=x/y  dQ/dt =[y(dx/dt)-x(dy/dt]/y2 or Q’=[yx’-xy’]/y2.
Curves have slopes. For straight lines the slope is constant, the “b” in
y=a+bx. For curves, slope steadily changes. For the curve Y=X0.5 in
Figure 18.3, the slope is steep at low X, but much shallower at high X.
Why bother about slopes, when building and testing logical models?
There are two major ways to make use of slopes:
 Finding minimum or maximum values for models that do not
involve slopes, and
 Building models on the basis of slopes themselves.
Examples of both uses will be given. Until then, the sentences above
may have little meaning for you, but believe me: Slopes offer
unbelievably powerful means for model building. First, we have to
introduce some properties of slopes.
19. The Importance of Slopes in Model Building
Notation for slopes
The steepness of a straight line matters, but at least this slope is
constant, the “b” in y=a+bx. We determined it in Figure 2.1 by dividing
the change in y by the change in x:
slope = (change in y)/(change in x) = Δy/Δx.
For curves, slope steadily changes, but the expression above still applies
approximately, when the increments Δx (delta-x) and Δy are small. The
slope of a curve at a given point is the slope of the line that barely
touches the curve at this point but does not cross it – the tangent to the
curve. Suppose a function y=f(x) is graphed, y vs. x, as in Figure 19.1.
The slope at first gets steeper, but then begins to decreases until, at the
peak, slope becomes zero: While x changes, y does not change. Thereafter, the slope becomes negative, as the curve goes downhill: while x
increases, y decreases.
How do we measure the slope at a point (x,y) on a curve? Advance x
by an increment Δx, but make it a vanishingly tiny amount; it is then
designated as dx. Then y will change by a vanishingly tiny amount,
positive or negative, designated as dy. The slope is now expressed more
precisely:
slope = (tiny change in y)/(tiny change in x)
= dy/dx.
When do we use Δx and when dx? When the difference in x is appreciable, Δx is used. When this difference is made ever smaller (Δx0) and
becomes “infinitesimally small”, then we use dx. To repeat:
The slope of a curve, at the given value of x, is dy/dx, the ratio
of a tiny change (dy) in y which takes place over a tiny change
(dx) in x.
Figure 19.1. Slope dy/dx is at first positive, then 0 at the peak, then
negative.
168
19. The Importance of Slopes in Model Building
Examples, for Figure 19.1:
 When dy/dx=0, y does not change with x – the curve is
horizontal, at the peak.
 When dy/dx=+10, the curve goes steeply up, as it does for a
while, prior to the peak.
 When dy/dx=-0.5, the curve goes moderately down, as it does at
the right end.
Figure 19.2. Parabola y=x2 and its slope dy/dx=2x.
y
2
8
1/3
6
1
4
2
2
-3
-2
-1
1/2
0
1
2
3
x
Equation for the slope of a parabola – and for Y=Xk
Just like the curve itself has an equation, so has its slope. For the line
y=a+bx, it is just dy/dx=b. For the parabola y=x2, the slope is dy/dx=2x.
How did I obtain this expression? We’ll see that later on. For the
moment, simply check on whether this equation makes sense. Does it
yield credible results? Look at Figure 19.2. (The 0-to-1 square, which
we have been using when thinking inside the box, would correspond to
the small rectangle outlined in bold, center bottom.)

At x=0, dy/dx=2x yields dy/dx=0, which certainly is the case:
The curve is horizontal at this point – its slope is zero, indeed.
169
19. The Importance of Slopes in Model Building



At x=2, dy/dx=2x yields dy/dx=4. On the graph, as x increases
by ½ units, y increases by 2 units, so 2/(1/2)=4, indeed.
At x=3, dy/dx=2x yields dy/dx=6. On the graph, as x increases
by 1/3 units, y increases by 2 units, so 2/(1/3)=6.
At x=-2, dy/dx=2x yields dy/dx=-4. On the graph, as x increases
by ½ units, y decreases by 2 units, so -2/(1/2)=-4.
Why bother about the parabola? It is an especially simple case of the
model Y=Xk, which we have been using so often. It will be seen that its
slope is dY/dX=kXk-1:
Y=Xk  dY/dX = kXk-1.
Exercise 19.1
Using Figure 19.2, we could verify that dy/dx=2x really produces
reasonable-looking slopes for the curve y=x2. Let us check if
dY/dX=kXk-1 yields reasonable-looking outcomes for simple values of
constant k in Y=Xk,.
a) If k=1, what are Y=Xk and dY/dX=kXk-1 reduced to?
Does this make sense?
b) If k=2, what are Y=Xk and dY/dX=kXk-1 reduced to?
Does this make sense?
Two ways to use slopes in model building were mentioned:


Finding minimum or maximum values for models that do not
involve slopes; and
Building models on the basis of slopes themselves.
An example for each use will now be given
Cube root law of assembly sizes: Minimizing communication load
The sizes (S, the number of representatives) of legislative assemblies
empirically tend to follow a cube root pattern: S=(2P)1/3, where P is the
adult literate population of the country. Why is this so? It is a matter of
keeping the number of communication channels as low as possible.
As S increases, the burden of communication channels on a single
representative goes down in her district but up in the assembly. Indeed,
the number of constituents a single representative must keep happy is P/S,
and this goes down as S increases. But the number of communication
channels she must monitor in the assembly is S2/2 (recall Chapter 6), and
170
19. The Importance of Slopes in Model Building
this goes up. It can be shown (Taagepera 2007: 199) that total number of
channels (c) is close to c=2P/S+ S2/2. When we graph c against S (Figure
19.3), we find that, as S increases, c first goes down, but then begins to
increase again. The optimal size is the one where c is lowest – this is the
size that minimizes the communication load.
Figure 19.3. The burden of communication channels on one representative. Assembly size Soptimal=(2P)1/3 minimizes the number of channels.
c=2P/S+ S2/2.
c
cmin
0
0
Sopt
S
How do we determine this optimal assembly size? We could draw the
curves c vs. S for each population size and look at which S we get the
lowest c. We can make it easier for us by observing that minimal c
corresponds to the location where the curve is horizontal, meaning
dc/dS=0. Given the equation for the curve, c=2P/S+ S2/2, we can
calculate the equation for its slope, using rules to be given soon. The
result is dc/dS=-2P/S2+S.
This equation gives us the slope at any assembly size – anywhere on
the curve. But now comes the clincher: We also require that dc/dS=0,
because this is the slope at the bottom of the curve. The result is
-2P/S2+S=0. Rearranging leads to S=(2P)1/3. Having a model to support
empirical observation, this relationship now qualifies as a law in the
scientific sense – the cube root law of assembly sizes:
S=(2P)1/3.
171
19. The Importance of Slopes in Model Building
In this example the model as such did not include slopes. Calculating
the slope and then requiring it to be zero was just a device to locate the
optimal value. Here we tried to minimize the output. Sometimes, to the
contrary, we try to maximize some output. Both minima and maxima
share the same feature: the slope is flat – dy/dx=0. Quite frequently,
however the logical model itself is built in terms of slopes. Exponential
growth is an important example.
Exponential growth as an ignorance-based model:
Slope proportional to size
Suppose we build a wall, using a steady number of masons who produces k meters of wall per day. Starting from the day construction
began, how does the size (length S) of the wall increase over time (t)? It
increases linearly: S=kt. The rate of increase is the slope of this straight
line:
dS/dt=k.
[wall]
The slope is a constant, as long as the number of masons is not altered
from the outside.
Now consider the way a microbe colony grows when food and space
are plentiful. At regular intervals, each microbe splits into two. Starting
with a single one, we have 1, 2, 4, 8, 16... Hence the growth rate is
proportional to the number of microbes. When their number doubles,
the rate of growth also doubles. For the size of the colony, this means
that here we have the slope dS/dt proportional to S itself:
dS/dt=kS.
[microbe colony]
Compare this equation to the one above. When something is built from
the outside, at a steady rate, dS/dt=k applies. In contrast, when something is building itself from the inside, at a steady rate, dS/dt=kS applies.
Such equations, which include a slope dy/dx, are called differential
equations. They often express the deep meaning of a process. These
differential equations, however, do not enable us to predict how much
stuff there is, at a given time. For this purpose we must use their
“integrated” form. For dS/dt=k, the integrated form is simply where we
started from: S=kt. For dS/dt=kS, it can be shown that the integrated
form is S=S0ekt – an exponential equation. Here S0 is the number of
microbes at time chosen as the starting point, t=0.
This S=S0ekt is also written as S=S0exp(kt). More generally, when the
size is S0 at time t0 (different from t=0), then
172
19. The Importance of Slopes in Model Building
S= S0exp[k(t-t0)].
(The format “exp” helps us avoid having subscripts within exponents!)
Now consider relative growth rate. This means growth rate divided
by the existing size: (dS/dt)/S. When expressed as percent of existing
size, it is also called the percent growth rate. For exponential growth,
dS/dt=kS, this relative growth rate is constant:
(dS/dt)/S=k.
[exponential growth]
This is what makes exponential growth or decrease so widespread and
basic. It’s the pattern of growth that takes place as long as relative
growth rate does not change. A larger k means steeper growth.
Let us now ask the following question. We are told that the size of
something is changing over time. We are asked to make our best guess
on whether its rate of change is increasing or decreasing. What would
be your best guess? Write it down.
In the absence of any other information, we have no grounds to
prefer increase to decrease, or vice versa. So our best guess is that
growth rate remains the same. If we have the further information that
this something is being constructed from the outside, our best guess is
constant absolute rate of change: dS/dt=k. But if we have the further
information that this something constructs itself, then our best guess is
rate of change proportional to size: dS/dt=kS, so that the relative rate of
change is constant.
Hence the exponential model is another example of an ignorancebased model. It is one of the most important and universal models of
that type. And it uses the notion of slopes.
Simple logistic model: Stunted exponential
Nothing continues to grow forever. Growth that starts exponentially
sooner or later levels off, reaching a maximum size (M). The simplest
way to impose a limit on something which begins as dS/dt=kS is to
multiply by (1-S/M):
dS/dt=kS(1-S/M).
When S is much smaller than M, (1-S/M)≈1, so we have exponential
growth dS/dt=kS. But when S reaches one-half of M, dS/dt=kS/2. This
means we are down to one-half of the exponential slope. When S
approaches M, (1-S/M) approaches 0, so the slope also approaches 0 –
growth stops. The resulting curve is the simple logistic curve – see
173
19. The Importance of Slopes in Model Building
Figure 19.4, which is the same as Figure 14.4. A larger k means steeper
growth from near-zero to near-ceiling.
Figure 19.4. Exponential curve and simple logistic curve which starts
out the same way.
S
Exponential
M
Simple Logistic
M/2
0
0
t
Integration of dS/dt=kS(1-S/M) yields
S=M/(1+e-kt)
when t is counted from the time when S=M/2. The proof is not shown
here, but check whether this equation makes sense at some simple
values of time, like zero and minus and plus infinity.
When t=0, e-kt= e0=1, so S=M/2, indeed.
When t+∞, e-kt=1/ekt 0, so SM, as it should.
When t-∞, e-kt ∞, so SM/(∞)=0, as it should.
174
19. The Importance of Slopes in Model Building
EXTRA
Recall from Chapter 14 that rearranging S=M/(1+e-kt) leads to S/(M-S)=
ekt – the ratio of distances from S to the floor and to the ceiling grows
exponentially as (M-S) shrinks ever more to nothing. Up to now we
counted time from the time when S=M/2. More generally, when the size
is y0 at time t0, then
y= M/{1 + [(M-y0)/y0]exp[-k(t-t0)]},
It follows that
y/(M-y) = exp[k(t-t0)]
and hence
log[y/(M-y)] = kt-kt0.
To test whether some data fits the simple logistic model, graph
log[y/(M-y)] vs. t – a straight line should result, with slope k. This is the
basis of so-called Logit data fits in statistics.
How slopes combine – evidence for the slope of y=xk
It can be seen that slopes are important building blocks and tools for
logical models. The following introduces some basic relationships. I’ll
try to show why and how they make sense. You’ll begin understanding
more as you make use of these formulas. The slope dy/dx is often
designated briefly as y’ when it does not lead to confusion.
When we add two functions, their slopes add:
y=y1+y2  dy/dx = dy1/dx + dy2/dx, or more briefly: y’=y1’+y2’.
Thus, when we add a straight line y1=a+bx and a parabola y2=x2, the
slope of y=y1+y2 is dy/dx=b+2x. This also implies that multiplying a
function by a constant multiplies its slope by the same constant:
y=ky1  dy/dx = k(dy1/dx).
The area of a rectangle is the product of it two sides: A=xy. What is the
rate of change of area when the lengths of the sides change over time?
This means, what is the value of dA/dt when the rates of change of x and
175
19. The Importance of Slopes in Model Building
y are dx/dt and dy/dt, respectively? I claim that we have to crossmultiply the functions and their slopes:
dA/dt =y(dx/dt )+x(dy/dt).
Why? Look at the rectangle in Figure 19.5, with sides x and y. Its area
is A=xy. Extend both sides by a bit, Δx and Δy, respectively. The area
increases by ΔA=xΔy+ yΔx+ΔxΔy, but we can neglect the tiny corner Δx
Δy. This corner becomes very small indeed, compared to xΔy or yΔx,
when we make Δx and Δy extremely small (dx and dy). Divide
dA=xdy+ydx by dt, and we get the rate of change over time.
In sum, when functions are multiplied together, the resulting slope
cross-multiplies functions and their slopes:
P=xy  dP/dt =y(dx/dt)+x(dy/dt), or more briefly: P’=yx’+xy’.
Figure 19.5. Area increase results from cross-multiplication of lengths
and length increases. The corner Δx Δy becomes negligible when Δx and
Δy become infinitesimally small.
Δy
xΔy
xy
y
x
Δx Δy
yΔx
Δx
Now we can prove that the slope of the parabola y=x2 is dy/dx=2x. In
Figure 19.5, make y equal to x, so that we have a square with area A=x2.
Extend both sides by Δx. The area increases by ΔA≈xΔx+xΔx=2xΔx
when we neglect ΔxΔx. Hence ΔA/Δx≈2x. Using y instead of A, and
going from Δx to tiny dx, the result is dy/dx=2x.
176
19. The Importance of Slopes in Model Building
Try the same approach for a cube of volume V=x3. If you do, you’ll
find that it leads to dV/dx=3x2. We have multiplied by 3 and reduced x3
to x3-1. We can generalize:
y=xk  dy/dx = kxk-1.
Maybe surprisingly, this rule applies even to fractional values of k. It
applies to negative k too. For y=1/x=x-1, dy/dx=(-1)x-2=-1/x2. In short,
y=1/x  dy/dx=-1/x2.
The negative sign makes sense: as x increases, y decreases – the slope is
negative. It is harder to visualize that the slope is sort of “stronger” than
the curve itself. Pick simple examples and check that this is so indeed.
Slopes are more complex when division is involved. For instance,
per capita GDP (Q) is obtained by dividing the total GDP (G) by
population (P): Q =G/P. Suppose both G and P change over time. What
is the rate of change of Q? It is dQ/dt =[P(dG/dt)-G(dP/dt]/P2.
How is this result obtained? I do not expect you to go through the proof.
But just in case, here it is. The proof combines the outcomes for y=1/x
and A=xy. Express Q=G/P as a product: Q=(1/P)G. By the rule of
products, dQ/dt=(1/P)(dG/dt)+ G[d(1/P)/dt]. Plugging in d(1/P)/dP=1/P2, we get dQ/dt =(dG/dt)/P+ G[(-1/P2)dP/dt]. Rearranging:
dQ/dt =[P(dG/dt)- G(dP/dt]/P2.
In a more general notation, Q=x/y  dQ/dt =[y(dx/dt)- x(dy/dt]/y2,
or more briefly: Q’=[yx’-xy’]/y2.
We may not need all these formulas in the examples that follow, but
these are the basic combinations. Some of them are likely to puzzle you
at first. They will become clearer with use. It helps if you give yourself
simple examples where you can figure out the result by other means –
and discover that the formula yields the same result.
Equations and constraints for some basic models
Table 19.1 shows the differential and integrated equations for some
basic models, along with the constraints on variables that impose such
equations. The differential and integrated equations are fully equivalent
expressions for the same model, but they may look quite different, and
they serve different purposes. The differential equation often expresses
177
19. The Importance of Slopes in Model Building
and explains better the mechanics of the process, while the integrated
equation enables us to predict the output from the input.
Table 19.1. Differential and integrated equations for some basic
models, and related constraints.
_____________________________________________________________________
Type of model
Differential  Integrated Corresponding constraints
equation
equation
Processes in time
External construction dS/dt=k

S=kt
Only two quadrants allowed
Self-propelled growth:
exponential
dS/dt=kS

S=S0ekt
Only two quadrants allowed
Exponential with ceiling:
Only a zone in two quadrants
simple logistic
dS/dt=kS(1-S/M)  S=M/(1+ekt) allowed
No process in time
dy/dx= k(y/x)  y=Axk
dy/dx= k(y/x)  y=xk
dy/dx messy  y=xk/[xk +(1-x)k]
Only one quadrant allowed
Box in one quadrant
Box in one quadrant, with
central anchor point
Starting from a very different angle, we are back to some equations that
emerged in Chapter 14. In the case of processes in time, the differential
equations are remarkably simple and express well the fundamental
underlying assumptions. In contrast, the integrated equations use more
complex functions. Yet only the integrated equations are “explicit” in
the sense of enabling us to calculate y from x.
For the fixed exponent equation, the reverse applies. Here the integrated form is simpler:
y=Axk  dy/dx=kAxk-1.
The differential form enables us to calculate the slope at any value of x.
Also note that Axk-1=Axk/x, and Axk=y. Dividing them, member by member, yields Axk-1
= y/x. Therefore, dy/dx=kAxk-1=ky/x. It turns out that here the slope is proportional to the
ratio of y and x:
y=Axk  dy/dx = k(y/x).
This form has a beautiful symmetry, but is not a very practical form. Could this pattern
correspond to another growth pattern in time? No, it cannot deal with time, because time
extends from minus to plus infinity, while here x is limited to positive values.
178
19. The Importance of Slopes in Model Building
When the box in one quadrant also has a central anchor point, the
differential form is even more complex than the integrated one, and it
adds little to our ability to visualize the pattern.
* Exercise 19.2
The previous model for cabinet duration in years, C=42/N2, has the form
Y=AXk. Keep in mind that 1/N2=N-2.
a) What would be the expression for dC/dN? [Use dy/dx=kAxk-1,
NOT dy/dx = k(y/x).]
b) Do such slopes agree with the graph in Exercise 2.1? To
answer this, insert into the equation for dC/dN simple values
like N=1, then N=5. Add lines with these slopes to your graph
in Exercise 2.1, at N=1 and N=5. Do they look tangent to the
curve C vs. N?
c) Why doesn’t this result apply to Figure 12.1? (HINT: Distinguish between x and logx.)
179
D. Further Examples and Tools
20. Interest Pluralism and the Number of
Parties: Exponential Fit




The exponential function y=Aekx is a most prevalent function in
natural and social phenomena. This is a basic pattern where the
curve approaches a floor or a ceiling without ever reaching it.
The coordinates of two typical points on the curve, plugged into
y=Aex, enable us to determine the numerical values of k and A.
If we want to have y=A when x=B, use format y=Aek(x-B). To
determine k, plug the coordinates of another typical point on the
curve into k=[ln(y/A)]/(x-B).
If the exponential format applies, the data cloud is linear when
graphed on semilog paper.
We are back to Arend Lijphart’s Patterns of Democracy (1999). A
central measure for the majoritarian-consensus continuum of democratic
institutions is the effective number of parties in the legislative assembly.
This number has a definite lower limit at N=1, but no clear-cut upper
limit. True, an assembly of S seats could fit at most S parties, but S itself
varies, and actual values of N are so much smaller (N<<S) that we
might as well say that N has no finite upper limit.
Some countries have many separate interest groups, while in some
others interest groups are organized under a few roof organizations,
such as those representing all labor and all employers. Lijphart makes
use of a compound index (I) that ranges from 0 to at most 4. How it is
formed need not concern us here. Maybe surprisingly, countries that
have many parties tend to have low interest group pluralism, and vice
versa, as if there were some complementarity. But what is the form of
this relationship? This may give us some clue on why the inverse
relationship exists in the first place. It turns out that I might decrease
exponentially with increasing N. So this offers an example of exponential relationship where no growth in time is involved.
Interest group pluralism
Figure 20.1 graphs interest group pluralism (I) versus N. At the top, we
see the original graph (Lijphart 1999: 183). It shows an OLS line (I on
N) in the midst of such high scatter (R2=0.30) that much of the entire
field is filled with points. At low N, most of the data points are above
the OLS line, while at high N, most points are below this line. So this
20. Interest Pluralism and the Number of Parties: Exponential Fit
does not look like a good fit. How is this possible? Two outliers, ITA
and PNG, top right, pull this side of the OLS line up, while having little
impact on the center of gravity. So the left side of the line gyrates lower,
the more so, because the outlier AUT pulls down. (NOR and SWE are
almost at mean N, and hence have little impact on slope.) We must look
for a better fitting format.
Figure 20.1. Interest group pluralism vs. effective number of parties:
Original graph (Lijphart 1999: 183) and addition of an open box and an
exponential fit.
Note: SOME SCANNED-IN FIGURES MAY NOT COPY IN
NORTH AMERICA.
182
20. Interest Pluralism and the Number of Parties: Exponential Fit
Delineating the allowed region (bottom graph) changes little. The
original limits shown on the left, top and bottom are indeed the
conceptual limits. It’s just that the box is wide open to the right.
Drawing in the other OLS line (N on I) does not help – it errs in the
opposite directions, compared to the first OLS line. Moreover, it
predicts forbidden values of I at low N, and at high N. So does the first
OLS line, for high N: For N>8, I will be negative. The symmetric
regression line (not shown) will have the same problem. We must look
for a better format than linear.
We have no data points at N=1. A democracy with only one party is
rare, to put it mildly, although Botswana (BOT) comes close, with
N=1.35 (one huge and one tiny tribal party). However, the general trend
seen in Figure 20.1 suggests that if N=1 ever materialized, then I might
be close to its upper limit of 4. No other value is more likely, so let us
take (1,4) as a tentative anchor point, which a reasonable data fit should
respect.
In addition to the anchor point at (1,4), we also have a conceptual
bottom at I=0. Data points may approach it, at high N, but cannot drop
beneath. So we must have a curve starting at (1,4) and gradually
approaching I=0 at large N, without ever reaching it. As seen in Chapter
15, the simplest format in such cases is an exponential function: y=Aekx.
The constant k indicates how rapidly y changes with increasing x. A
positive value of k means increasing y. A negative value of k means a
decreasing y – this is the case here.
We replace y with I, and x with N, so the general format is I=AekN. It
will soon be shown that the specific form is
I = 4e-0.31(N-1),
as shown in the bottom part of Figure 20.1. This data fit also implies
that the rate of change of I with increasing N is
dI/dN=-0.31I,
and the relative rate of change is constant:
(dI/dN)/I=-0.31.
But le us first check whether this equation fits the data. At low N, this
curve passes below most data points, so the fit is not much better there
than it is for the OLS line I-on-N. Indeed, a simple straight line passed
through (1,4) might seem to fit the data cloud better, and the symmetric
regression line (not shown) would come close. Trouble is, such a line
183
20. Interest Pluralism and the Number of Parties: Exponential Fit
would predict a negative value of I for N larger than 6. Our data fits
should not predict absurdities, even though such high values of N do not
occur in that particular data set. So the exponential fit still is preferable.
It fits most points at high N, apart from those blatant deviants, ITA and
PNG.
Have we encountered a similar pattern earlier? Think. No, do not ask
if we have dealt with interest groups earlier. Think in terms of similarities in allowed areas and anchor points. What do we have here? We
have conceptual limits on 3 sides only. We have an anchor point at one
end, plus a gradual approach to conceptual limit at the other end. Where
have we met it before? Volatility. Its refined model (Figures 11.2 and
15.6) is the mirror image of the present one: approach to a ceiling
instead of a floor. Here we have the broad form y=Aekx. For volatility, it
is y=1-Aekx. It is extremely important to recognize such broad similarities.
Fitting an exponential curve to interest pluralism data
How can we determine a fit like I = 4e-0.31(N-1)? We can start in two
ways: the basic exponential format y=Aekx, or a format which looks
slightly more complex: x=Aek(x-B). We’ll do it both ways – and what
looks more complex turns out simpler. The starting point is the one
suggested in Chapter 14: pick two “typical” points and fit the equation
to them. One of those points should be the anchor point (1,4). Then pick
a data point at a moderately large N that seems to be close to the central
trend. In Figure 20.1 something close to VEN (3.4,1.9) might do.
APPROACH I. The format y=Aekx means here I=AekN. Plugging in the
coordinates of one typical point, then the other, yields two separate
equations in k:
4=Ae1k and 1.9=Ae3.4k.
Dividing member by member cancels out A:
1.9/4=e3.4k/e1k.
Hence 0.475= e(3.4-1)k=e2.4k. Take logarithms – and here it pays to push
the lnx button rather than logx, because ln(ex)=x by definition. So here
ln0.475=2.4k. Hence
k=ln0.475/2.4=-0.744/2.4=-0.3102,
184
20. Interest Pluralism and the Number of Parties: Exponential Fit
which we round off to -0.31. The value is negative, because the curve
goes down.
To find A, we plug k=-0.31 into one of the initial equations.
Actually, I often do the calculations with both initial equations, to guard
against mistakes. Here
4=Ae1×(-0.31) and 1.9=Ae3.4×(-0.31)= Ae-1.054, so that
A=4e0.31 and A=1.9e1.054.
(Note that the exponential goes to the other side of the equation and
hence changes sign!) By using the ex or 2ndF/lnx button,
A=4e+0.31=4×1.363=5.454 and A=1.9e+1.054=1.9×2.869=5.451.
The difference is due to rounding off k, and A=5.45 is sufficiently
precise. The final result is
I=5.45e-0.31N.
This equation can be used to calculate points on the curve shown in the
bottom Figure 20.1. Trouble is, the anchor point (1,4) does not stand out
in this equation. We can easily see what the value of I would be at
N=0 – it would be 5.45 – but this is pointless information when N≥1 is
imposed. This is where the second format comes handy: x=Aek(x-B).
APPROACH II. The format x=Aek(x-B) means here I=Aek(N-B). The
constants A and B go together in the following way: When N=B, then
I=A, given that e0=1. As values of A and B, it is convenient to plug in
the anchor point coordinates (4, 1):
I=4ek(N-1).
Thus, we already have the value of A, and it remains to find k. We again
plug in (3.4,1.9): 1.9=4e2.4k, and again k=-0.31 results. The final result
now is
I=4e-0.31(N-1).
It yields exactly the same values as I=5.45e-0.31N, but it also keeps the
anchor point in full view. And it was easier to calculate – the value of A
was there without any calculations.
185
20. Interest Pluralism and the Number of Parties: Exponential Fit
More generally, if we want to have y=A when x=B, use format
y=Aek(x-B). To determine k, plug the coordinates of another typical point
on the curve into k=[ln(y/A)]/(x-B).
GRAPHIC APPROACH. We could graph the data on semilog paper,
with I on the logarithmic scale (or graph logI vs. N on regular paper).
The corresponding data are available in Lijphart (1999: 312-313). See if
the data cloud is linear or bent. If it looks sufficiently straight, the
exponential model applies. Starting at the anchor point (4, 1), draw in
the visual best-fit line. The slope of line corresponds to the value of k.
But semilog paper is based on the decimal “log”, while the exponential
formula connects to the natural log, “ln”. The difference may well
confuse you at first, so check your results with one of the other
approaches.
Note that, apart from ITA and PNG on the high side and AUT, SWE
and NOR on the low side, all data points are below the curve I=2×4e0.31(N-1)
but above the curve I=(25/2)e-0.31(N-1). (See Exercise 20.1.) This
means that if we graph logI vs. N, the data cloud would become a
straight zone, like the one in Figure 15.7, with zone limits at one-half
and double the expected value.
The data fit I=4e-0.31(N-1) is approximate. Given the degree of scatter,
a more precise statistical fit would add little to our understanding of the
relationship. But do we have a logical model? It is hard even to argue
that interest group pluralism should decrease with increasing N (which
is a good indicator of consensus-minded politics). It is even harder to
explain why it should decrease exponentially, and at the rate observed
(as reflected in k=-0.31). These are questions we should ask, even if we
do not reach answers.
Exercise 20.1
The curves I=8e-0.31(N-1) and I=2e-0.31(N-1) are not shown in Figure 20.1.
Sketch them in. How? Select a couple of convenient points on the main
curve, and place by eye the points twice that high and one-half that
high. Then join them.
186
20. Interest Pluralism and the Number of Parties: Exponential Fit
The slope of the exponential function
The simplest exponential function is y=ex. This is what we get when we plug
A=1, k=1 and B=0 into y=Aek(x-B). It has the unique property that, at any
value of x, its slope equals the size itself:
d(ex)/dx= ex.
This is the only mathematical function to have this property. Indeed, if we
require that slope must equal the function, then the function f(x)= ex results –
and this is what defines the numerical value of e=2.71828… in the first place.
This is when k=1.With any other value of k, the slope is proportional to
the size itself:
d(Aek(x-B))/dx=kAek(x-B).
This is why the curve approaches zero without ever reaching it, when k is
negative: As y becomes small, so does its further decrease.
Whenever something grows or decreases proportional to its existing size,
it follows the exponential pattern. Growth of bacterial colonies or human
populations and growth of capital at fixed interest follow this pattern, with a
positive k. The amount of radioactive material decreases according to this
pattern, with a negative k. Indeed, the exponential function pops up
whenever some variable x can take any values, from minus to plus infinity,
while the connected variable y can take only positive values, without any
further restrictions. This makes the exponential a most prevalent function in
natural and social phenomena, on a par with the linear y=a+bx and the fixed
exponent y=Axk that can be reduced to Y=Xk.
Exercise 20.2
At the ECPR (European Consortium for Political Research) Workshops
in Rennes 2008, Rense from Nijmegen told me the following. They help
the unemployed to write better CVs and hence be more successful in
seeking jobs. They think they do something socially useful, but they are
told that this isn’t so. They do not affect the number of jobs available. If
they help one person to get a job, then simply someone else fails to get
this job. They supposedly just help one person at the expense of
another, and the social benefit is nil. Rense and his colleagues disagree
but find it hard to respond.
So who is right, or is there a third way out? Does it make sense to
help the unemployed to write better CVs? Close your eyes for the
moment and ponder: How would I tackle this issue? (If you omit this
step you might lose more than you think!) Scribble down a few ideas.
And now proceed to the following.
187
20. Interest Pluralism and the Number of Parties: Exponential Fit
a) Which variables are involved? OK, the number of positions
filled (F) is one variable. What else? No idea? Well over which
variable does F change?
b) Label this other variable T. Sketch a graph, F vs. T.
c) Introduce a simplifying assumption. Assume that initially all
positions the economy needs (FM) are filled (F=FM) but then a
recession sets and some workers lose their jobs, so that the
number of filled positions drops to FR. Later, at T=0, the
recession suddenly goes away and FM positions again need to
be filled, but only FR positions are filled. Enter these levels (FM
and FR) on your graph, at T=0. Prior to T=0, F= FR. (Actually,
recessions recede only slowly. But taking this into account would make the
issue pretty complex right at the start. Better assume instantaneous recovery,
for the moment, and leave the more complex situation for later.)
d) Also enter the ceiling, i.e., the level FM toward which the
employers try to raise F during the period T>0. What is the
simplest model to express the notion that “the number of
positions filled tends toward the number of positions available
the faster, the larger the gap between the two”? Offer the format
of the corresponding equation (with no specific numbers), and
sketch the approximate graph.
e) Why aren’t the vacant positions filled instantaneously? It’s
because all employers and jobseekers suitable for each other do
not meet. The mutual fitting together takes time. How would
this time change for those who present themselves with
improved CVs? Enter (always on the same graph) the curve for
those with improved CVs.
f) Which curve would tend to be followed immediately after T=0?
To which curve would it shift later on?
g) Then how do those who help to write better CVs alter the
situation? Are they of any help to job seekers as a group? Are
they of any help to economy as such?
h) Would these conclusions change if we made the description
more realistic by assuming that the job market expands from FR
to FM only gradually?
i) And now to the most important step. Stop for a while and think
back: How would you have tackled this issue on your own,
compared to what I have put you through? Would you agree
with my approach? If you don’t, fine – as long as you take me
as seriously as you take yourself. At the end of the task, just
enter: “Yes, I have pondered it.”
188
20. Interest Pluralism and the Number of Parties: Exponential Fit
EXTRA I: Why not a fit with fixed exponent format?
The anchor point at (1,4) and approach to a floor at I=0 can also be
fitted with fixed exponent format I=4Nk. Indeed, I=4N-0.60 =4/N0.60 yields
a curve fairly close to that of
I=4e-0.31(N-1). It is slightly lower at N=2 and remains higher at N=6, so
the fit to data is visually only slightly worse. An analogous fit for
volatility was pointed out in Chapter 11, EXTRA. Why should we prefer
an exponential fit?
Suppose we have a positive value at the start, and it decreases
toward a floor. If the anchor point is at x=0, the format y=Axk cannot
work, because at x=0, y is either 0 (when k is positive) or infinity (when
k is negative), but not some intermediary value. We are left with the
exponential option by default.
If the anchor point is at some positive value of x, as is the case in I
vs. N, then both options are possible. The exponential approach still has
the following advantages.
1) Uniformity. To be able to compare, it is better to keep the same
format, regardless of whether the anchor point is at x=0 or x>0.
2) Rate of change is proportional to the existing size, hence
relative (percent) rate of change is constant. This is a powerful
general model of ignorance. The format y=Axk does not offer
anything that simple.
For these reasons, exponential fit is our preferred option. If this fit does
not work, we could consider y=Axk and even other options. In Figure
20.1, for instance, I=10/(N+1.5) yields a curve intermediary between
those of I=4e-0.31(N-1) and I=4/N0.60. Even if it should fit better in a
statistical sense (it doesn’t), it should be avoided, unless its format has
some logical justification.
189
20. Interest Pluralism and the Number of Parties: Exponential Fit
Figure 20.2. Disproportionality vs. number of parties: Original graph
(Lijphart 1999: 169), and addition of exponential fit with a vague
anchor point.
EXTRA II: Electoral disproportionality
This example, again from Lijphart (1999), is left as an EXTRA, because
the format x=Aeky (rather than the usual y=Aekx) may be a bit confusing
for beginners.
190
20. Interest Pluralism and the Number of Parties: Exponential Fit
Figure 20.2 deals with electoral disproportionality (D). This is the
difference between vote and seat shares of parties. It ranges in principle
from 0 to 100%. I will not describe how it is determined. While the
previous graph had something else graphed against N, here Lijphart
(1999: 169) prefers the opposite direction, N against D. If one insists on
a linear fit, a low R2=0.25 results. But it’s quite visible that no linear fit
does justice to the data cloud, because the cloud shows a clear
curvature.
The allowed region again has finite limits on 3 sides, even while one
of them isn’t visible in the graph. While N can range upward from 1,
with no clear upper limit, D can range upward from 0, and no higher
than 100 percent. But the situation is more fluid than that. Consider the
situation at N=1, meaning that a single party has won all the seats. What
level of disproportionality could be expected? If it also had all the votes,
D would be zero. At the other extreme, D could be 100% only if that
party obtained no votes at all. It is argued in Taagepera and Shugart
(1989: 109-110) that the average outcome can be expected to be around
D=25%. So we take (N=1, D=25) as an anchor point.
If D were graphed against N (rather than vice versa), the picture
would be quite similar to the previous I vs. N. (You may find it easier to
turn the page by 90 degrees, so that the D axis goes up.) The exponential format D=Aek(N-1) looks possible. Taking N=1, D=25 as an anchor
point and D=0 as the conceptual bottom (which is vertical in the actual
graph!) leads to the
D = 25ek(N-1).
Picking NOR as a typical central point leads to k=-0.66 so that
D = 25e-0.66(N-1).
Here this fit is clearly more satisfactory than any linear fit could be.
Note furthermore that, apart from PNG, the highest points (FRA,
IND) are near the curve D=4×25e-0.66(N-1), while the lowest points
(MAL, AUT, GER, SWE) are near the curve D=(25/4)e-0.66(N-1). This
means that if we graph logD vs. N, the data cloud would become a
straight zone, like in Figure 15.7, but the zone limits would be at onequarter and 4 times the expected value.
How close are we to a logical model? The anchor point is somewhat
fluid. It makes sense that higher N leads to lower D, given that higher N
tends to reflect more proportional electoral rules. Also, as N increases,
its further impact on D should gradually become milder. In this respect
the exponential pattern makes conceptual sense.
191
21. Moderate Districts, Extreme
Representatives: Competing Models




When both x and y are conceptually limited to the range from 0
to 1, with anchor points (0,0) and (1,1), the simplest fit is with
Y=Xk.. When a further anchor point is imposed at (0.50,0.50),
the simplest fit is with Y=Xk/[Xk +(1-X)k].
When this third anchor point is shifted away from (0.50,0.50),
even the simplest fit becomes more complex.
Different logical approaches sometimes can be applied to the
same problem and may fit data about equally well. We have to
ponder which one has a wider scope.
All other things being equal, a model with no adjustable parameters is preferable to one that has such parameters. A smooth
model is preferable to a kinky one. A model that connects with
other models is preferable to an isolated model.
We now reach situations with more than two constraints, and the
models become more complex. I’ll explain the resulting equations in
less detail, compared to exponential and fixed exponent equations.
Why? First, we are less likely to encounter these specific forms in
further research. Second, by now we should be able to recognize some
general patterns that repeat themselves, in a more complex form.
Our starting point is a graph in Russell Dalton’s Citizen Politics
(2006: 231), reproduced here as Figure 21.1. It shows the degree of
conservatism of representatives (on y-axis) as compared to the
conservatism of their districts (on x-axis) in US House elections. “There
is a very strong congruence between district and representative opinions
(r=.78) [thus R2=0.61], as one would expect if the democratic process is
functioning” (Dalton 2006: 231). Reporting the value of R implies a
linear fit. This line is not shown, but we can mentally draw it in. It
would predict negative y for x less than 0.10, and y above 1 for x above
0.80. What else can we see in this graph, beyond “When x goes up, y
goes up”? What else should we add to the graph, so as to see more?
Graph more than the data
By now we should know it. Delineate the entire conceptually allowed
area. It both reduces and expands on the field in the original graph, as x
21. Moderate Districts, Extreme Representatives: Competing Models
and y can both range from 0.00 to 1.00. Next, consider conceptual
anchor points. In a 100 percent conservative district, the representative
has a strong incentive to vote 100 percent conservative, while in a 100
percent radical district, the representative has a strong incentive to vote
100 percent radical. No such data points can be seen, but the most
extreme data points do approach the corners (0,0) and (1,1). Any data fit
should connect these anchor points. Also draw the equality line y=x.
Figure 21.2 shows these additions.
A puzzle emerges as soon as we graph the equality line. District and
representative opinions do not match. In conservative districts,
representatives are more conservative than their average constituents,
and in radically non-conservative districts, representatives are more
radical than their average constituents. The representatives tend to be
more extreme than their districts. Why do they do so, and why precisely
to such degree? Note that we seem to have a third anchor point, halfway up, as briefly introduced in Chapters 9 and 14.
Figure 21.1. Attitudes of representatives and their districts: Original
graph (Dalton 2006:231) – data alone.
Note: SOME SCANNED-IN FIGURES MAY NOT COPY IN
NORTH AMERICA.
193
21. Moderate Districts, Extreme Representatives: Competing Models
How would a representative tend to vote in a 50-50 district? The
situation is analogous to seat distribution when votes divide split 50-50
(Figure 9.4). In in the absence of any further information, we have no
reason to guess at more or less than 0.50. So we would expect the curve
to pass through the point (0.5,0.5). This agrees roughly with the data
cloud.
Note that the data cloud itself appears somewhat different once the
equality line, forbidden areas and anchor points are introduced. The
free-floating bent sausage of Figure 21.1 appears in Figure 21.2 as
visibly squeezed in between the equality and bottom lines, at the left
corner. More diffusely, the same applies at the top right corner. The data
cloud rises sharply around the central anchor point.
Figure 21.2. Original graph (Dalton 2006:231) plus conceptually
allowed region, equality line and 3 anchor points.
A model based on smooth fit of data to anchor points
When X and Y can range from 0 to 1, and three conceptual anchor points
impose themselves – (0,0), (0.5,0.5) and (1,1) – the simplest family of
curves passing through these anchor points is (recall Chapter 9)
194
21. Moderate Districts, Extreme Representatives: Competing Models
Y = Xk/[Xk +(1-X)k].
[3 anchor points, no bias]
Here parameter k can take any positive values. This equation can also be
expressed more symmetrically as
Y/(1-Y) = [X/(1-X)]k.
When k=1, we obtain Y=X. Then the representatives would be as
extreme as their districts. Values of k exceeding 1 lead to curves in the
shape of a “drawn-out S”. Then the representatives are more extreme
than their districts. This is what we observe. (Values of k less than 1 would
lead to “drawn-out S” curves on the opposite sides of the equality line. The
representatives would be more moderate than their districts.)
Figure 21.3. Further addition of a smooth model, based on continuity
between three anchor points, fitted to the data.
The parameter k expresses the steepness of the slope at X=0.5. For the
given data, k=3.0 is close to best fit. Figure 21.3 shows the
corresponding curve. To determine the precise best linear fit, we should
graph log[Y/(1-Y)] against log[X/(1-X)]. But we do not need more
precision. What we need is an explanation for the broad pattern we
observe.
195
21. Moderate Districts, Extreme Representatives: Competing Models
We have fitted the data to the simplest mathematical format imposed
by conceptual anchor points. But what is it that imposes k=3 rather than
k=1 (equality line Y=X) or k=2 (a shallower central slope) or k=4 (a still
steeper slope)? In the absence of an answer, let us try a different
approach that may offer some insight.
Figure 21.4. Graph in Figure 21.2, plus a kinky model based on
extremes of representative behavior and their means.
A kinky model based on political polarization
Simplify the issue as much as we can, as a first approximation. Consider
a two-party constellation with a purely Conservative Party and a purely
Radical Party. The US Republican and Democratic Parties are imperfect
approximations to such ideologically pure parties. Under such conditions we may assume that all districts with X>0.5 elect a Conservative
and all those with X<0.5 elect a Radical. Consider the extreme possibilities.
If a society is extremely polarized, all Radical representatives will
take 0% conservative stands (Y=0) while all Conservative representatives will take 100% conservative stands (Y=1). A “step function”
results (Figure 21.4), meaning a sudden jump at 0.5: Y=0 for X<0.5 and
196
21. Moderate Districts, Extreme Representatives: Competing Models
Y=1 for X>0.5. On the other hand, if the society isn’t polarized at all, all
representatives may wish to reflect their constituent mix, leading to the
Y=X line in Figure 21.4.
Figure 21.5. Graph in Figure 21.2, plus combination of two kinky
models, based on different ways to take means.
At the center of the graph, we have then two extremes: the vertical line,
and the line at slope 1. In the absence of any other information, take the
mean of these conceptual extremes at the center of the graph. The mean,
however, can be taken in at least two ways. Using angles, the mean
slope is 2.41, while using cotangents leads to 2.0. Shown as dotted and
dashed lines in Figure 21.4, both look rather shallow, compared to the
data cloud. Also, they would predict Y to drop to zero for X less than
0.25. We would have a kinky model. For central slope 2: Y=0 for
X<0.25; Y=-0.50+2X for 0.25<X<0.75; and Y=1 for X>0.75. (Calculation
of slopes: The mean angle between the lines is (90o+45o)/2=67.5o, and tan67.5=2.41.
The mean of cotangents, (0+1)/2=1/2, leads to slope 2.)
But take a different look at what might happen at X=0.25. Consider a
Radical elected in such a 25% conservative district. If she wants to
please only her radical voters, she’ll take a 0% conservative stand (i.e.,
Y=0), placing her at the Polarized line. But if she wishes to serve all her
197
21. Moderate Districts, Extreme Representatives: Competing Models
constituents equally, her stand would be Y=0.25, at the Not-Polarized
line. The mean of these two extremes is (0+0.25)/2=0.125. It’s easy to
generalize that the Radical representatives would follow Y=X/2, on the
average. This is roughly what we observe, indeed, for Radicals in
Figure 21.5, at least up to X=0.4. The corresponding line for
Conservative representatives would be Y=0.5+X/2. This is roughly what
we observe for Conservatives in Figure 21.5, for X>0.60. The resulting
curve would consist of two lines at slopes 0.5 interrupted at X=0.5 by a
vertical jump – a second kinky model.
The first kinky model fits the data better in the center and the second
one at the margins of the graph. The two lines intersect around X=0.37
and X=0.63, as shown in Figure 21.5. [NB! In my Figure the central section is
drawn a bit steeper than it should.] The combined model has less sharp kinks,
but it still has them.
Comparing the smooth and kinky models
In sum, we have two partly conceptually based models, both of which
respect the 3 conceptual anchor points. First, we have Y=Xk/[Xk +(1X)k]. This is the simplest continuous curve. It involves no assumptions
of a political nature. Its slope k at X=0.5 is a freely adjustable parameter.
We do not know why the best fit with data occurs around k=3. Second,
we have a kinky model that starts from politically grounded upper and
lower values. Its predictions do not depend on any free parameter. Its
central slope is slightly shallower than the observed tendency.
Figure 21.6 compares the two models. The smooth model has the
advantage of continuity of slope. Nature rarely offers sharp kinks like
those in the second model. However, at very low and high X, the kinky
model agrees better with data. In the center it does less well than the
continuous model, but it is unfair to compare a model with no adjustable
parameters to one that has one. Of course a model with an adjustable
parameter can do better – but that still leaves the need to explain why
the parameter has the value it has. For a model with no adjustable
parameter, the kinky model does remarkably well – and it leaves no
unexplained features. [NB! In my Figure the central section of the kinky model is
drawn a bit steeper than it should be. This means that the fit to data is a bit less good
than it looks in the graph!]
Thus both models have advantages. Testing with further data (for
other US Congress periods and for other two-party parliaments) might
show how stable the central section of the pattern is. Variations in
parameter value of the continuous model might also cast some light on
what it depends on. The US House has lately been relatively highly
198
21. Moderate Districts, Extreme Representatives: Competing Models
polarized. How would the pattern differ in earlier, less polarized times?
It should be kept in mind that the kinky model presumes clearly radical
and clearly conservative parties. This presumption breaks down for the
times the US Democrats had a large conservative southern wing.
Figure 21.6. Comparison of smooth and kinky models.
The envelopes of the data cloud
The envelopes of a data cloud are the curves that express this cloud’s
upper and lower limits. Such curves are proposed in Figure 21.7. The
area they enclose includes most data points, though not all. Like the
median curve, the upper and lower envelopes tend to follow a drawnout S shape, starting at (0,0) and ending at (1,1). However, they do so
with a bias: They reach Y=0.50 around X=0.35 and X=0.65, respectively, meaning a bias of B=±0.15 compared to the unbiased X=0.50.
The simplest family of such biased curves is (Taagepera 2008: 109110)
Y= Xbk/[Xbk+(1-X b)k].
[3 anchor points, bias exponent b]
199
21. Moderate Districts, Extreme Representatives: Competing Models
This can also be expressed more symmetrically as
Y/(1-Y) = [Xb/(1-Xb)]k.
Here the exponent b is connected to bias B as b=-log2/log(0.5+B). For
unbiased system, b=1. For B=±0.15, b=0.66 and b=1.61, respectively.
For the envelope curves, we should keep the same value k=3 that we
used for the mean curve, unless the data strongly hints at something
else.
Figure 21.7. Envelope curves.
Figure 21.7 shows the resulting envelope curves, along with the average continuous curve:
Y = X3b/[X3b+(1-X b)3],
where b=1 for the average curve and b=0.66 and b=1.61, respectively,
for the envelope curves. Both the average trend and the limits of the
data cloud are expressed reasonably well. Fairly few data points remains
outside the area they delimit, and fairly few blank areas remain inside.
[Extending the bias range to B=±0.19 would make the area to include almost all outliers.
Maybe it should be done during a revision.]
200
21. Moderate Districts, Extreme Representatives: Competing Models
Why are representatives more extreme than their districts?
This example has both methodological and substantive purposes. On the
methodological side, it has shown that, compared to the published
analysis, much more could be extracted from data. The data cloud could
be described much more specifically than just “When x goes up, y goes
up” in a linear way. This more specific description offered a better
starting point for asking why the data followed this particular pattern.
In addition to offering a specific example of a symmetric drawn-out
S curve, it has also introduced biased drawn-out S curves. This brief
introduction will not enable one to calculate the parameters for such
curves, but one should be able to recognize the shapes of such data
clouds and ask for help when the need arises. The notion of envelope
curves has also been introduced.
Different logical approaches sometimes can be applied to the same
problem, and they may fit data about equally well. One still has to
ponder which one makes more sense. Each of the models presented here
is preferable on the basis of the following: 1) All other things being
equal, a model with no adjustable parameters is preferable to one with
such parameters. 2) All other things being equal, a smooth model is
preferable to a kinky one. What do we mean by “making more sense”?
It’s a question of which model is more useful in the long run. Models
with a wider scope and more ability to connect with other models are
more useful. For this purpose, the smooth format offers more hope.
On the substantive side, our more detailed data analysis leads to a
logically based answer to the question “Why are representatives more
extreme than their districts?” They try to balance off serving their hardcore partisans and serving all their constituents. This assertion leads to
more specific follow-up question “Why are they more extreme to the
degree that we observe, and not more or less so?” Here our kinky model
offers a quantitative answer that does not quite fit. The continuous
model can be made to fit better, thanks to its adjustable parameter, but
the specific value of this parameter remains to be explained. Maybe the
two approaches could be combined, somehow.
201
22. Centrist Voters, Leftist Elites: Bias Within
the Box


“Always” start scales from zero, not from “1”.
Among competing models, the one that covers a wider range of
phenomena carries it.
The next graph in Dalton (2006: 233), reproduced in Figure 22.1,
seems to represent an issue close to that in Figure 21.1. Instead of
representatives and districts, we have party elites and voters. Instead of
conservatism in the US, we have opinions on Left/Right scale in
Europe. At first glance, we may see a quite similar upward curve in both
figures. This time, the equality line has already been inserted in the
original figure, and like in Figure 21.2, the data cloud crosses this line.
But there are differences.
Figure 22.1. Party elites and voters: Original graph (Dalton 2006:233) –
data plus equality line.
22. Centrist Voters, Leftist Elites: Bias Within the Box
First, take a quick look at Figure 22.1, turn away, and answer the two
following questions. If you wish to place yourself at the center of the
Left/Right scale, which score would you offer? Also: How many scale
intervals are there on the x scale?
If you said 10 scale intervals, look again and count them. There are 9
intervals, because the lowest possible rating is 1, not 0. If you placed
yourself at 5, you didn’t declare yourself a pure centrist but ever-soslightly Left-leaning, because the arithmetic mean of 1 and 10 is 5.5, not
5.0. One can suspect that many respondents may have placed themselves at 5 when they really meant 5.5. This would introduce some
distortion – and can we be sure that voters and elites would make this
mistake in equal proportions? The relationship between their opinions
may be a bit distorted.
We already met this confusing aspect of using scales 1 to 10 in
Chapter 18. We now have cause to repeat the advice: “Always” start
scales from zero, not from “1”. Still, “always” is in quotation marks,
because exceptions may occur.
In the present case, if we continued with the scale 1 to 10, we would
continuously be confused. We better convert immediately to scales that
run from 0 to 1: X=(x-1)/9 and Y=(y-1)/9 (cf. Exercise 18.2). These
scales are introduced in Figure 22.2. We further demarcate, of course,
the allowed region and two anchor points (0,0 and 1,1) and extend the
equality line to the right top corner.
Now we may notice a difference, compared to previous example. It
is made even more evident in Figure 22.2 by drawing in the vertical and
horizontal lines passing through the center point (0.5,0.5). This center
point is outside the data cloud. The data are biased toward the right. It
means that, with 3 exceptions, party elites are left of their voters – and
this applies even to right-wing parties!
Instead of a symmetric curve, we have a biased one, like the lower
envelope curve in Figure 21.7. We can fit the data to the previous
equation that includes bias:
Y= Xbk/[Xbk+(1-X b)k].
[3 anchor points, bias exponent b]
The equation has two adjustable parameters, so we have to fit with two
points in between 0 and 1. Let us fit a central location in the data cloud
on the right, such as a spot below the point labeled CSU, and another on
the left, such as in between the points labeled PS and LAB. By trial and
error, I reached approximately b=1.73 and k=1.04. This k barely differs
from 1.00, in which case the equation would boil down simply to Y=Xb.
The corresponding curve is shown in Figure 22.2.
203
22. Centrist Voters, Leftist Elites: Bias Within the Box
This is not the statistical best fit of Y= Xbk/[Xbk+(1-X b)k] to data. (In
retrospect, I would rather fit to the points CDU and LAB, so as to reduce bias and make
the curve steeper, reaching higher above the equality line.) But it is sufficiently
close to get the picture. Given the degree of scatter, we could be almost
as well off with a fit to Y=Xk – and it would be much simpler. Then why
don’t we do it? Because we would lose comparability with the previous
case – and comparisons are important for theory building.
Figure 22.2 Original graph (Dalton 2006:233) plus corrected scales and
a model including bias.
If we switched to a different model, we’d imply that the European and
the US situations are not comparable – and this would be the end of
comparative theorizing. If we stick to the same general format,
however, we imply that similar processes might be involved. Then we
can decompose the question into several. Why is it that the US picture
in Chapter 21 is unbiased while the European elites pull toward the Left
(b=1.00 vs. b=1.73)? Why is it that the swing at the crossing point is
high for US elites and almost nil for the European (k=3 vs. k=1.04)? Or
204
22. Centrist Voters, Leftist Elites: Bias Within the Box
are these two features related? Is it a difference of countries or a
difference in the question asked (voting on issues vs. expressing
opinion)?
As further data of a roughly similar type are collected, we may be
able to answer such questions. In contrast, it would be a dead end if we
limited ourselves to the original graphs (Figures 21.1 and 22.1). We’d
simply observe that “y increases with increasing x” and that’s it. Not
only would we not have answers – we wouldn’t even have questions
begging for answers.
What about our kinky model in the previous chapter? It plainly fails
here, on the right side of the field. This is illustrative about competing
models. The one that applies over a wider range of phenomena carries
it. Here neither model from the previous chapter works in a direct way.
But the continuous model can be adjusted, by introducing a bias
parameter. I see no way to adjust the kinky model.
* Exercise 22.1
These graphs have the same format as Figure 22.1: elite opinion vs.
voter opinion. They come from an earlier book by Russ Dalton (1988)
and include 14 parties in France, Germany and UK. Different from
Figure 22.1, they present opinions on specific issues. The first graph has
abortion (Dalton 1988: 215). In the second I have superimposed graphs
on three issues where the patterns look rather similar: nuclear energy;
further nationalization of industry; and aid to third world nations
(Dalton 1988: 214, 216, 217). Due to this superimposition, the labels
are blurred, and the same party label occurs three times.
205
22. Centrist Voters, Leftist Elites: Bias Within the Box
Make an exact copy of Figure above. If you can’t use a copy
machine, paste this graph on a well-lighted window, paste blank paper
on top of it, and trace carefully all the points.
a) What type of models could be used here?
b) Add the corresponding data fits to the graphs.
c) Compare to the envelope curves in Figure 21.7.
d) Do everything you can, approximately – no detailed data fitting
is expected.
e) Give some thought on why it is that the elite-voter relationship
is different for abortion. How does this issue stand apart from
the 3 others?
206
23. Medians and Geometric Means







Unlimited ranges go with normal distributions, arithmetic
means, and linear relationships. Positive ranges go with lognormal distributions, geometric means, and fixed-exponent
relationships. The logarithms of the latter are equivalent to the
former.
For two-peaked distributions, the overall median and means
often make no sense.
For single peaked distributions, choose a mean close to the
median. Geometric means are often more meaningful than
arithmetic means when only positive values can occur.
To calculate the arithmetic mean of n numbers, add them, then
divide by n: (100+10+4)/3=38. For geometric mean, multiply
them: 100104=4000, then take the n-th root. On a pocket
calculator, push key ‘yx’, enter n, push key ‘1/x’, push key ‘=’
and get 40001/3=15.87≈16.
The geometric mean of numbers corresponds to the arithmetic
mean of their logarithms.
Whenever fitting with normal distribution yields a standard
deviation larger than one-quarter of the mean, we should dump
the normal fit and try a lognormal fit instead.
When x cannot take negative values but some values drop to 0,
then calculate the geometric mean, replacing these 0 values by
values close to the smallest non-zero value.
Geometric means entered right in the first chapter – they could not be
avoided. Now is time to compare the median and various means more
systematically. When talking about means or averages, we tend to
visualize a value such that half the items are smaller and half the items
are larger. This is the definition of the median. Yet, instead of medians,
we actually often deal with arithmetic or geometric means. Depending
on the nature of data, either one or the other mean approximates the
median. Why don’t we just use the median, if this is what we really are
after? The median itself is often awkward to handle, as we will soon
see. We should know which mean to use, instead. We should also
recognize situations where when neither mean is satisfactory, and
sometimes even the median makes little sense.
What difference does it make, which mean is used? Suppose we are
told that the monthly mean income in a country is 10,000 euros. We
may now think that half the people earn more than 10,000, but actually
23. Medians and Geometric Means
only about one-third do. How come? Because a few millionaires can
outweigh many poor people, when it comes to arithmetic mean. Indeed,
take three incomes, 1,000, 3,000 and 26,000 euros. Their arithmetic
mean (10,000) depends too much on the largest income. The geometric
mean (4,300) is closer to the median (3,000).
It matters very much whether a variable could take only positive
values or any values from minus to plus infinity (Chapter 14). This is so
for the means, too. When variables can range from minus to plus
infinity, only the arithmetic mean is possible. When variables are
restricted to positive values, then the geometric mean is the likeliest to
reflect the median.
A two-humped camel’s median hump is a valley
All this presumes a single-peaked distribution of data. When the frequency curve has two peaks, the overall median (and means) may
mislead us badly. The “median hump” of a two-humped camel lies in
the valley between the two humps. The worldwide distribution of
electoral district magnitudes consists of a sharp spike at M=1 for
countries that use single-member districts (SMD), followed by a
shallow peak around M=5 to 15 for countries that use multi-seat
districts. This distribution looks like a minaret and a shallow dome next
to it. Depending on the exact proportion of SMD countries, the median
could happen to be M=1, hiding the existence of multi-seat districts, or
around M=2 to 5, where few actual countries are located.
What should we do? If the distribution has two peaks, try to handle
each peak separately. In the following, we assume one-peaked
distributions. Then the broad advice is to pick a mean reasonably close
to the median.
Arithmetic mean and normal distribution
Suppose 35 items are distributed as in Table 23.1. (These are the frequencies
of measurements in Chapter 29, Extra 1.) Most cases are in the center. Most
important, the two wings are symmetric, as one can see when looking at
the number of cases for each size. The median is 5, and so is the
arithmetic mean (A). Recall Chapter 1: To calculate the arithmetic mean
of n numbers, we add them, then divide by n:
A = Σxi/n.
Here, A=(75+2×76+3×77+… +85)/35 = 2800/35 =5.
208
23. Medians and Geometric Means
Table 23.1. Hypothetical sizes of 35 items.
__________________________________________________________
Size
75 76 77 78 79 80 81 82 83 84 85
Number
1 2 3 4 5 5 5 4 3 2 1
of cases
When the distribution is symmetric, the arithmetic mean yields the
median. One particular symmetric distribution that occurs frequently is
the bell-shaped normal distribution, roughly shown in Figure 23.1.
What’s so “normal” about it? This is a prime example of an ignorancebased model. When the items involved can in principle have any
positive or negative values, the very absence of any further knowledge
leads us to expect a distribution that fits a fairly complex equation – the
one that produces the normal curve. (This equation is not shown here.)
Normal distribution has long symmetric tails in both directions. Its
peak (the “mode”) is also the median and the arithmetic mean. Normal
distribution is fully characterized by two constants, the arithmetic mean
and a typical width, standard deviation (σ, sigma). Each tail that goes
beyond one standard deviation includes only 1/(2e)=1/(2×2.718)=18.4%
of all the cases. The form of the curve is prescribed by our ignorance,
but the location of the mean and the width σ of the peak are not.
Figure 23.1. Normal and lognormal distributions. For normal
distribution, standard deviation σ characterizes the width of the peak.
σ
0
σ
median=
arith. mean
0
median=
geom. mean
Beyond two standard deviations, the normal distribution falls to very
extremely values. However – and this is important – it falls to utter zero
only at plus and minus infinity. Thus, in principle, normal distribution
does not apply to quantities that cannot go negative. When applied to
heights of people, normal distribution would suggest that, in extremely
rare cases, people with negative heights would occur.
True, if the mean is far away from zero and standard deviation is
much less than the mean, then normal distribution still works out as a
209
23. Medians and Geometric Means
pretty good approximation. Suppose that women’s mean height is 160
cm and the standard deviation is 10 cm. Then the zero point is so far
away (10 standard deviations) that it might as well be at minus infinity.
However, suppose it is reported that the mean number of telephones
per 1000 people in various countries is 60, with standard deviation 70.
(I have seen such reports in print!) This is what Figure 23.1 shows. It
would mean that more than 18% of the countries have negative numbers
of telephones! Obviously, the actual distribution is not normal, and the
normal model must not be applied. Our models must not predict
absurdities.
In sum, for normal distribution the arithmetic mean equals the
median. Strictly taken, normal distribution applies only to quantities
that can range from plus to minus infinity. We can apply it to quantities
that cannot go negative, if standard deviation turns out to be much less
than the mean. When, to the contrary, standard deviation exceeds one
quarter of the mean, normal distribution must not be used. What do we
do instead? Try lognormal distribution.
Geometric mean and lognormal distribution
To calculate the geometric mean, we multiply them, then take the n-th
root:
G = (Пxi)1/n.
Like Σ (capital sigma) stands for summation of terms xi, here П (capital
pi) stands for product of terms. For 100, 10 and 4, the product is
100104=4000. On a pocket calculator, push key ‘yx’, enter n, push
key ‘1/x’, push key ‘=’, and get 4,0001/3=15.87≈16.
Table 23.2. Hypothetical sizes of 10 items.
x
1
log x .00
2
.30
5
.70
7
.85
10
11
13 20
50
100
1.00 1.04 1.11 1.30 1.70 2.00
Suppose we have 10 items of sizes shown in Table 23.2. These might
be the weights of dogs in US pounds. The median is (10+11)/2=10.5.
The arithmetic mean is much larger – 21.9 – because it is much too
much influenced by the few largest components. But the geometric
mean is quite close to the median:
G = (Пxi)1/n = (1×2×5×….×100)1/10 = 10,010,000,0000.1 = 10.
210
23. Medians and Geometric Means
Here the geometric mean reflects the median. This is so because we
have many small and few large entries. If we group the data by equal
size brackets 0 to 19.9, 20 to 39.9, etc., we get 7 – 1 – 1 – 0 – 0 - 1 items
per group – the distribution is far from symmetric, with peak at the low
end.
The geometric mean corresponds to the median when the distribution is lognormal. This is a distribution where all entries are positive,
and it has a long drawn out tail toward the large values, as shown in
Figure 23.1.
Why is this distribution called lognormal? Because the distribution
of the logarithms of such data is normal. Take the logarithms of the 10
items above, as also shown in Table 23.2. Group these logarithms by
equal size brackets 0 to 0.49, 0.5 to 0.99, etc., and we get 2 – 2 – 4 – 1 –
1. This is more balanced, with a peak at the center. (By juggling the data
a bit, I could get a perfectly symmetric distribution, but this would be
somewhat misleading. With so few data points, we obtain only
approximations to smooth distributions.) What is the meaning of going
by brackets 0 to 0.49, 0.5 to 0.99, etc.? We effectively divide data into
multiplicative slots: going from 1 to 3.16, then from 3.16 to 10, then
from 10 to 31.6, and so on.
The geometric mean has a similar relationship to the arithmetic
mean: The geometric mean of numbers corresponds to the arithmetic mean of their logarithms. Indeed, G=(Пxi)1/n leads to
logG = (1/n)log(Пxi) = (1/n)log(x1 x2 x3…) =
= (1/n)(log x1+logx2 +logx3…) = Σ(logxi)/n.
When our pocket calculator does not have a “yx” key but has “logx” and
its reverse, “10x”, then we can take the logarithms of all numbers and
calculate their arithmetic mean, which is logG. Then put 10 to the
power logG, to get G.
Note in Figure 23.1, that the geometric mean for the lognormal
distribution is NOT at the mode (the peak) – it is further right. It has to
be, because the areas under the curve, left and right of the median, are
equal by definition. The arithmetic mean is even further to the right.
211
23. Medians and Geometric Means
The median is harder to handle than the means
When talking about means or averages, we most often are really
interested in the median. When the distribution is normal, the median
equals the arithmetic mean – this was the case for numbers in Table
23.1. When the distribution is lognormal, the median equals the
geometric mean– this was approximately the case for numbers in Table
23.2.
But if we are really interested in the median, then why don’t we just
calculate the median? The median is often awkward to handle, for the
following reasons. For arithmetic and geometric means, we just add or
multiply the numbers in any random order. This is easily done on a
pocket calculator. For median, we would have to write them out,
arranged by size. This is a minor hassle, and of course, computers can
handle it easily.
But suppose someone else had a data set like the one in Table 23.2.
She reports that for these 10 items, A=21.9, G=10.0, median =10.5.
Much later, we locate two other items that fit in, with values 6 and 8.
They clearly reduce the median and the means, but by how much? What
are the new values?
For the arithmetic mean, it’s simple. We restore the previous total
(10×21,9=219) and add the new items. The new arithmetic mean is
A’=(219+6+8)/(10+2)=133/12=19.4. Similarly, the new geometric mean
is G’=(10.010×6×8) 1/(10+2)=(4.8×1011)1/12=9.4. For the median, in
contrast, we are stuck. Having its previous value does not help us at all.
We’d have to start all over, arranging the items by size – but we don’t
have the original data! Even if we can locate the author, she may have
already discarded the data.
This is why medians are harder to handle than means. We can build
on previous means, but with medians we may have to start from scratch.
The sticky case of almost lognormal distributions
Recall the situation where the mean number of telephones per 1000
people in various countries is 60, with standard deviation 70. The
distribution clearly cannot be normal. We should try lognormal
distribution and calculate the geometric mean rather than the arithmetic.
But suppose there is one country without a single telephone. Singlehandedly, it sinks the geometric mean to zero! Take the data in Table
23.3. The arithmetic mean clearly exceeds the median (0.9), but G=0
would under-represent it outrageously. What should we do? I propose
the following.
212
23. Medians and Geometric Means
When the numbers to be averaged are a1=0<a2< a3< a4<… , replace
a1=0 by a number a1’ such that a1’/a2=a2/a3. This means
a1’=a22/a3.
In the present case a1’=0.22/0.7=0.06, and G=0.75, reasonably close to
the median.
Table 23.3. Hypothetical sizes of 6 items, one of which is 0.
ai
0 0.2
0.7
1.1
2.0
10.0
Median A
0.9
2.33
Calculating G for non-0 items only counts 0 as if it were 1.252.
Replacing 0 with 1 counts 0 at more than median value.
Replacing 0 with 0.22/0.7=0.06 yields G close to median.
G
0
1.252
1.206
0.755
Why do I propose this? Using equal ratios means we assume that the
values at the low end of the lognormal distribution increase roughly
exponentially. What other choices do we have?
Some computer programs sneakily omit the 0, without telling you so,
calculate the geometric mean of the 5 non-zero items, and report
G=1.252. By so doing they effectively replace the “0” by 1.252, a
number that in this case is larger than the median! (Test it: Replace 0 by
1.252, take the geometric mean of the 6 numbers, and see what you get.)
This patently makes no sense. Some other programs multiply together
the 5 non-zero items but then take the 6th root rather than the 5th, and
report G=1.206. By so doing they effectively replace the “0” by 1,
which is again a number larger than the median. (Test it!) This makes
no sense either. We should not replace 0 by something that exceeds the
smallest non-zero item.
213
23. Medians and Geometric Means
How conceptual ranges, means, and forms of relationships are connected
Unlimited ranges go with normal distributions, arithmetic means,
and linear relationships.
Positive ranges go with lognormal distributions, geometric means,
and fixed-exponent relationships.
The logarithms of the latter are equivalent to the former.
This is quite a mouthful. What does it mean? Chapters 14 and 23 are interrelated. The conceptually allowed range matters in the following way.
When any values are possible, from minus to plus infinity, and things are
randomly distributed, they are likely to follow a “normal” pattern around the
arithmetic mean. Its equation (not shown here) is established on sheer probabilistic grounds, based on ignorance of anything else. Also, when two
variables can range from minus to plus infinity, their relationship may well be
linear.
When only positive values are conceptually possible, this constraint
“squeezes” the negative side of the normal distribution to zero, and lognormal
distribution results, with median at the geometric mean. Also, when two
variables are constrained to only positive values, it “squeezes” their simplest
relationship from linear to fixed exponent.
This reasoning works even more clearly in the opposite direction. Take the
logarithms of 1 and of 0. Log1 is pulled down to 0, while log0 is pulled way
down, to minus infinity. We are now back to the unbounded range, where the
logarithms of values from 0 to 1 fill the entire negative side. This is how the
lopsided lognormal distributions of x, such as people’s incomes, are pulled out
into nicely symmetric normal distributions of logx. The geometric mean of x
then corresponds to the arithmetic mean of logx. And the curve y=Axk, limited
to x>0 and y>0, is unbent into the straight line logy=logA+klogx, which can
take negative values.
When we notice these broad correspondences, many seemingly arbitrary
and unconnected features start making sense in a unified way. Physical and
social relationships respect the resulting patterns, not because of some
conscious decision to do so but precisely because these patterns describe what
happens when no conscious decision is taken. Randomness has its own rules.
214
24. Fermi’s Piano Tuners: “Exact” Science and
Approximations




Exact sciences mean sciences that strive to be as exact as
possible but no more exact than needed. It avoids getting stuck
in details.
Estimates impossible at a first glance can be decomposed into a
sequence of simpler estimates that can be answered approximately.
One should develop a sense for typical sizes of things, which
could then be fed into such approximations. This is part of basic
numeracy.
Dimensional consistency is needed – and it is also helpful in
calculations.
We have repeatedly used approximations, and you may feel uneasy
about it. Aren’t natural sciences “exact sciences” and shouldn’t social
sciences try to become more of an exact science? Well, exact science
does not mean that every result is given with four decimals. It means
being as exact as possible at the given stage of research. This makes it
possible to be more exact in the future.
As exactly as possible – and as needed
Nothing would stifle advance of social science more than advice to give
up on quantitative approaches, just because our first measurements
involve a wide range of fluctuation or our conceptual models do not
quite agree with the measurements. A three-decimal precision will
never be reached, if one refuses to work out problems approximately, at
first.
But even “as exact as possible” must be qualified. In later applications, there is little point in using more precision than is needed for the
given purpose. And while building a logical model, one might initially
apply “no more exact than needed”. What does it mean? Take volatility.
We restricted ourselves to a single input variable – the number of
parties. Even here we at first built a linear model that ignored a glaring
constraint: volatility cannot surpass 100 per cent. We simplified and
aimed at an approximation that might work at the usual levels of
volatility, much below 100 per cent. We got a fair fit to this coarse
24. Fermi’s Piano Tuners: “Exact” Science and Approximations
model (Figure 11.2). The refined exponential model would be needed
only if we met a huge effective number of parties. If we had tackled the
refined model first, we might have been lost in needless complexities.
How many piano tuners?
Approximations are not something social scientists are reduced to. The
ability to approximate is useful in physics too, as the great physicist
Enrico Fermi kept stressing. (How great was Fermi? Great enough to
have the 100th element called Fermium.) His favorite question to new
students was: “How many piano tuners are there in the city of New
York?”
This takes longer than “How many parties might win seats”. Fermi’s
students’ first reaction most likely was that they could not possibly
know. It wasn’t even a question about physics. Fermi’s intention was
two-fold. First, show that questions can be decomposed into a sequence
of simpler questions that can be answered approximately. Second,
develop in students a sense for typical sizes of things, which could then
be fed into such approximations.
For piano tuners, our first question would be: How many people are
there in the city of New York? Forget about how one defines this city,
within a larger megapolis. Forget about decimals. An estimate of 10±2
million is close enough. However, we must have some general sense of
how large cities are. It helps to have one city’s population memorized.
If you have memorized that Chicago has about 5 million or that
Stockholm has about 1 million, you can peg other cities to that knowledge. Would Budapest be larger or smaller than Chicago or Stockholm? Knowing that Hungary has 10 million people would help.
Many cities in the developing world are presently growing with such
speed that my estimates might fall behind the times. Also, since Fermi’s
times (say, 1960), pianos have given way to electronic noisemakers. So
let’s specify: How many piano tuners might there have been in the city
of New York around 1960? The sequence of questions might be the
following.
 How many people in New York?
 How many households in New York? We would have to ask
first: How many people per average household?
 What share of households might have a piano? And compared
to the number of household pianos, how many pianos elsewhere
(concert halls etc.)?
 How often should a piano be tuned? And how often is the
average piano tuned?
216
24. Fermi’s Piano Tuners: “Exact” Science and Approximations


How long does it take to tune a piano?
So what is the total workload for piano tuners, in hours, during
one year?
 At 40 work hours per week, how many piano tuners would New
York keep busy?
Exercise 24.1
How many piano tuners might there have been in the city of New York
around 1960?
a) Offer your estimate for each step.
b) Calculate the resulting number of piano tuners.
c) A student of mine looked up the US census around 1975.
Believe it or not – it did have the number of piano tuners in
New York: about 400. By what percentage or by what
factor was your estimate off?
The range of possible error
How good is our guess for the number piano tuners? How much off are
we likely to be? Let us look realistically at the possible errors at each
step.
 How many people in New York? We might be off by a factor of
2: relative error ×÷2, meaning “multiply or divide by 2”.
 How many households in New York? Once we ask this, we
discover that we first have to ask: How many people per
average household? It could again be ×÷2.
 What share of households might have a piano? This is where
students in the 1960 and 1970s were liable to be far off,
depending on whether their own parents did or didn’t have a
piano – and we have to think back into no-computers
surroundings. Also, compared to the number of household
pianos, how many pianos elsewhere (concert halls etc.)? This
adds to the uncertainty. We might be off by ×÷4.
 How often does a piano get tuned? We may have no idea.
Phoning a piano tuner would give us only a lower limit – how
often a piano should be tuned. But how often is the average
piano tuned in reality? I know one that hasn’t been tuned for 20
years. We might be off by ×÷3.
 How long does it take to tune a piano? We might be off by ×÷3,
if we have not seen one at work.
 So what is the total workload for piano tuners, in hours, during
one year? We have a long string of multiplication but no new
error, unless we make a computation error.
217
24. Fermi’s Piano Tuners: “Exact” Science and Approximations

Assuming a 40-hour workweek, how many piano tuners would
it keep busy? No new error here either, unless we make a
computation error.
By what factor are we likely to be off on the number of piano tuners?
Suppose we overestimate every factor we multiply and underestimate
every factor we divide by. By the error estimates above, we could then
overestimate by a factor of
2×2×4×3×3=144≈150.
Instead of 400 tuners, we could have found 150×400= 60,000 tuners. If
we underestimated to the same degree, we could propose 400/150=
2.7≈3 tuners. An estimate that could range from 3 to 60,000 – this was
not the point Fermi was trying to make. When you did the exercise
above, you probably got much closer, most likely within ×÷10, meaning
40 to 4000 piano tuners. Why?
You were most likely to err in random directions. At some steps
your error boosted the number of tuners, and at some other step it
reduced it. If you were very lucky, you might have horrendous errors at
each step, yet end up with the actual value, if your errors cancelled out
perfectly.
What is the most likely error range on multiplicative sequences of
estimates? My educated guess is: Take the square root of the maximal
combined error. For piano tuners, 1441/2=12. Indeed, being off by ×÷12
fits with my average experience with untutored students. However, the
combined error cannot be lower than its lowest single component. If we
combine ×÷2 and ×÷4 to ×÷8, then the actual likely error is ×÷4, even
though 81/2=2.8.
Dimensional consistency
We briefly encountered dimensional consistency requirement in Chapter
6. It looked like another formality. In the present problem, it actually
becomes extremely useful. But if it mystifies you, leave it for later – it’s
not essential, just helpful.
We started with the city population. The larger it is, the more piano
tuners there must be. But so as to find the number of households, should
we multiply or divide by the number of people in a single household?
We may use our good sense. If we multiply, we get more households
than people, so I guess we must divide. And so on.
218
24. Fermi’s Piano Tuners: “Exact” Science and Approximations
But introduce units, or quasi-units. City population is not just so
many million – it’s so many million persons. What is the unit for the
number of persons in the household? The unit is persons/household.
Now, if by mistake we multiply the population and the number of
persons per household, the units also multiply, resulting in
[x persons][y persons/household]= xy [persons]2/household.
Persons squared? No, this is not what we want. So let us try dividing:
[x persons]/([y persons/household]= (x/y) households.
By the usual rules of arithmetic, persons/persons cancel out, and
1/[1/household]=household. Yes, this is the unit we want to have. So
division must be the right way to combine the population and the
number of persons per household.
Overall, we want a sequence where, after the rest cancels out, only
the unit “tuners” remains. I’ll drop the numbers (such as x and y above)
and show only units. Go by three steps. The first is for the total number
of pianos:
[persons] [pianos/household]
= pianos.
[persons/household] [household pianos/all pianos]
Check that, indeed, everything else cancel out, leaving pianos. The
second step is for the total work hours/year needed:
[tunings/(year×piano)] [work hours/tuning] = work hours/year.
The third step is for the number of piano tuners:
[work hours/year]
[work hours/(tuner×week)][weeks/year]
= tuners.
Check that everything else cancel out, leaving tuners. We could do the
whole operation in one mammoth step:
[persons][pianos/household][tunings/(year×piano)][hours/tuning][work hours/year]
= tuners.
[persons/household][household pianos/all pianos][work hours/(tuner×week)][weeks/year]
Here we just have to insert the numbers in their proper places (multiplying or dividing). But maybe such a long single sequence is too much
of a mouthful.
219
25. Examples of Models across Social Sciences
__________________________________________________________
 Sociology offers examples where the number of communication
channels can be used.
 Political history offers examples where reasoning by extreme
values and differential equations, especially exponential, come
in handy.
 In demography, differential equations, exponential and more
complex, are useful.
 Economics offers examples for using allowed areas and anchor
points, plus differential equations.
Most examples in this book referred to political institutions. This is so
because I found the simplest examples in that subfield. Do the models
presented apply beyond political science? A student asked me so. Here
are some examples from my own work.
Sociology: How many journals for speakers of a language?
How would the circulation of journals (J) published increase as the
number of speakers of the language increases? Language means communication. As population (P) increases, the number of communication
channels (c) increases proportional to P squared. Journals might expand
at a similar rate, because with more speakers more specialized journals
can be introduced.
Thus we might test the model J=kP2. Circulation per capita, j=J/P,
would then be j=kP. Per capita circulation is proportional to population. Graphing on log-log scale, the slope must be 1, if the model fits.
Such a test needs a group of languages under rather similar
conditions, so as to eliminate a multitude of other factors. I found such a
group in the form of languages belonging to the Finno-Ugric language
family and spoken in the Soviet Union, where the authoritarian regime
made the conditions highly uniform. Circulation was weighted, giving
as much weight to a thick monthly magazine as to a daily newspaper.
Graphing on log-log scale does indeed lead to slope 1 for nations with
more than 10,000 speakers (Figure 25.1). For smaller populations,
figures become erratic. The best fit is around j=2×10-6P (Taagepera
1999: 401-402).
This is an example how the basic notion of the number of communication channels applies in sociology.
25. Examples of Models across Social Sciences
Figure 25.1. Weighted per capita circulation of journals vs. number of
speakers of Finno-Ugric languages. Data from Taagepera (1999: 401–
402).
2
Weighted
per Capita
Circulation
0.2
0.02
10,000 Population
100,000
1,000,000
Political history: Growth of empires
As history unfolds, more advanced technology enables larger polities
(political entities) to form, so that the total number of polities decreases.
What is the pattern of this decrease? First, we have to define the number
of polities, given that some are large while others are tiny. The
aforementioned effective number of components (Chapter 2) comes
handy. Should we consider size by area or by population? Let us do
both. But the two are interrelated, because large polities do not form in
empty space – they tend to form where population is the densest.
Indeed, it can be shown that the effective number of polities by area can
be expected to be the square of their number by population: NA=NP2.
This follows from consideration of geometric means of extremes.
As we graph logN against time, over 5,000 years (Figure 25.2), we
find that the broad pattern is linear, which means that N decreases
exponentially with time. This is not surprising, because exponential
change is the simplest pattern around. But the two lines are interconnected in two ways.
221
25. Examples of Models across Social Sciences
First, they both must reach N=1at the same time, as this would
corresponds to a single polity encompassing the entire world. The
statistical best-fit lines (OLS logN vs. t) cross around N=3, but it would
take only a minor adjustment to make them cross at N=1. This would
happen around year 4,000. So don’t expect (or be afraid of) of a single
world empire any time soon.
Second, if NA=NP2, then the slope for the area-based lines must be
double the slope for population-based line. This is pretty much so,
indeed. But why does the rate constant k (the slope in Figure 25.2) have
this particular value? Why isn’t the slope shallower or steeper? We
don’t know as yet.
This is an example how the exponential model and the use of geometric means of extremes apply in political history.
Figure 25.2. Decrease of effective number of polities over time
(Taagepera 1997), based on area (NA) and population (NP). Since N is
graphed on log scale, the exponential curve becomes a straight line.
Demography: World population growth over one million years
Differential equations are widely used in demography. Growth of
populations often follows the exponential pattern dP/dt=kP. However,
growth of world population from CE 400 to 1900 followed an even
steeper pattern: P=A/(D-t)m, where A and D are constants. How do we
determine D? By trial and error.
222
25. Examples of Models across Social Sciences
The nasty thing about such a “quasi-hyperbolic” growth is that,
when time reaches t=D, population would tend toward infinity. World
population began to veer off this model by 1900 and by 1970 clearly
began to slow down. How can we modify quasi-hyperbolic growth to as
to include this slowdown? The model involves population interacting
with technology and the Earth’s carrying capacity (Taagepera 2014). It
projects toward a population ceiling at 10 billion. The corresponding
graph in Figure 25.3 has not yet been published. It shows two separate
phases and a puzzling kink around CE 200.
This is an example how the exponential and more complex rate
equations apply in demography.
Figure 25.3. Human population during the last one million years
graphed on logarithmic scale against time prior to 2200, also on logarithmic scale. Vertical error flags show the ranges of estimates.
223
25. Examples of Models across Social Sciences
Economics: Trade/GDP ratio
Some countries export most of what they produce; so they have a high
Exports/GDP ratio and a correspondingly high Imports/GDP ratio.
(GDP means Gross Domestic Product.) Some other countries have little
foreign trade, compared to their GDP. Could this ratio depend on
country size? Let us carry out a thought experiment.
If a country came to include the entire inhabited world, what would
be its Trade/GDP ratio? It must be 0, since this country has no one to
trade with. So we have an anchor point: P=Pworld  Exports/GDP=
Imports/GDP=0.
At the opposite extreme, if a country consisted of a single person,
what would be its Trade/GDP ratio? It would be 1, because all this
person’s monetary transactions would be with people outside her own
country. So we have another anchor point: P=1  Exports/GDP=
Imports/GDP=1.
It is now time to draw a graph of forbidden areas and anchor points
(Figure 25.4). Visibly, P can range only from 1 to Pworld, and Exports/
GDP can range from 0 to 1 (or 100%). Given the huge difference
between 1 and Pworld, we better graph P on logarithmic scale.
Figure 25.4. Exports/GDP graphed against population. In 1970,
Pworld≈4×109.
1
Exp/
GDP
Exports/GDP=
1- logP/logPworld
0.5
Actual,
roughly
0
1
105
Population (log scale)
Pworld
How can we move from the top-left anchor point to the one at lower
right? The simplest way would be a straight line, and indeed, a tentative
224
25. Examples of Models across Social Sciences
entropy-based model (Taagepera 1976) would lead to just that: A
fraction logP/logPworld of GDP is not exported, and hence Exports/
GDP=1-logP/logPworld. Trouble is, actual data do not fit. Countries with
populations below one million tend to have much higher trade ratios
than expected, while countries with populations above 10 million tend
to have lower trade ratios than expected. The curve shown in the graph
very roughly indicates the actual trend.
From this point on, some skills are used which are beyond the scope
of this book. The main thing is to note that it all starts with Figure 25.4:
allowed area and anchor points.
Figure 25.5. Empirical approximations for dependence of Imports/GDP
and Exports/GDP on population (Taagepera and Hayes 1977).
225
25. Examples of Models across Social Sciences
Taagepera (1976) started with the classic differential equation for
absorption, in physics: dI/dr=-kI. Here I is flow intensity, r is distance
from source (of neutrons in physics, or of goods in economy), and
constant k reflects how rapidly the stuff produced is absorbed. This
means simple exponential decrease: I=I0 e-kr, where I0 is the intensity at
the source, such as a factory producing goods. The flow that reaches the
country border is counted as export. There are many sources (factories)
spread across the country, so that the equations become more complex.
In one dimension, they can be solved. Unfortunately, countries are twodimensional, making it even more complex. An approximate solution
can be worked out, and it fits the data cloud.
This model makes outrageous simplifying assumptions. It is
assumed that an infinite flat world is uniformly populated. Thus a finite
world population cannot enter. All goods are assumed to be absorbed at
the same rate over distance, be it milk or oil. The model can be tested
using either country area or population. Not surprisingly, population
yields a better fit – people absorb goods, not space.
Despite such simplifications, this model still explains how trade
depends on population, on the average. For practical purposes, it was
found simpler to replace the complex model by empirical approximations (Taagepera and Hayes 1977):
Imports/GDP = 40/P1/3.
Exports/GDP = 30/P1/3.
How could imports exceed exports? Countries also pay for imports by
revenue from shipping, tourism, exporting labor, etc. As seen in Figure
25.5, the Imports equation fits within a factor of 2, while scatter is much
wider for Exports, due to the variety of the other revenue items. Note
that Figure 25.4 has the trade ratio on regular scale and in fractions of 1,
while Figure 25.5 has it on logarithmic scale and in percent. My coauthor for this work, Jim Hayes, was an undergraduate student.
How do those approximations fit with the logical anchor points? At
populations equal to world population (around 4 billion in 1970), the
equations predict Imports/GDP =0.0006=0.06% and Exports/GDP=
0.04%, rather than the conceptually required 0. For a logical model, any
non-zero value would be unacceptable, but for an approximation, this is
quite close – as long as we keep in mind that this is an approximation.
At the other extreme, Imports/GDP would reach 1 when P=6,400 rather
than 1, and Exports/GDP would do so at 2,700 persons. This is far from
the anchor point of 1 person.
The detailed model and it approximations in Figure 25.5 could be
part of an S-shaped curve (see Chapters 21 and 22) which joins the
226
25. Examples of Models across Social Sciences
anchor points and goes steeply down around P=1 million (106). Indeed,
we could get an imperfect fit of data to logical anchor points by using
the equation used in Figure 21.7, which corresponds to adding a third
central anchor point: Y/(1-Y)=[X/(1-X)]k.
Figure 25.6 is similar to Figure 25.4, except that the x-scale has
logP divided by logPworld, so that logically allowed values range from 0
to 1. We could get an imperfect fit of data to logical anchor points by
using an equation similar to the one introduced in Figure 9.4: (1Y)/Y=[X/(1-X)]k. This symmetric “drawn-out S” curve drops too soon,
compared the data cloud. A better fit is obtained by adding a “bias
exponent” b (cf. Chapter 21): (1-Y)/Y =[Xb/(1-Xb)]k. Thist shifts the
bending point away from 0.5,0.5.
Why would the Export/GDP ratio take such a path between the two
anchor points, rather than a simpler path such as Y=Xk? This remains to
be explained. Imagine a country with only 300 inhabitants, so that
X=0.25 in Figure 25.6. We can well imagine that it would still export
almost all of what it produces and import almost all of what it needs, in
line with the dashed part of the curve. On the other hand, consider a
country with 30 million people, so that X=0.75 in Figure 25.6. We can
visualize that it could easily produce much more than 75% of what it
needs, in line with the curve shown. Thus this curve makes sense. But
it’s a long way from what makes sense to a quantitatively predictive
logical model.
Figure 25.6. Exports/GDP graphed against normalized population.
1
Y=
Exp/
GDP
(1-Y)/Y = [Xb/(1-Xb)]k,
very roughly
Exports/GDP=
1- logP/logPworld
=1- X
0.5
Actual,
roughly
0
0
0.25
0.5
X = logP/logPworld
227
0.75
1
25. Examples of Models across Social Sciences
Like in Chapter 21, we observe here several competing and partly
contradictory approaches. They offer examples how allowed areas,
anchor points, and differential equations more complex than the
exponential can apply in economics.
228
26. Comparing Models






One can establish logical relationships among factors without
worrying too much about causality, ability to indicate a specific
process or mechanism, and distinctions between models and
formats, and between general and substantive models.
Broad models based on near-total ignorance apply in all
sciences. Normal distribution and exponential change are prime
examples.
Constraints on the range of values of factors can determine the
simplest forms of relationships between them. Call these
relationships logical models or logical formats – the way to use
them does not change.
Substantive models are specific to a field of inquiry. They are
bound to operate within the framework of broader logical
formats. They may show up when data deviate from the simple
ignorance-based expectations.
Many models are “as if”: Complexly integrated processes
behave (produce outcomes) as if they could be broken up into
simple components.
Development of truly substantive social models, which introduce specifically social processes or mechanisms, depends on
interlocking of fairly universal partial models: Connections
among connections.
I feel uneasy about this concluding chapter. If you feel it helps you, use
it. If it doesn’t, forget it. This comparative perspective may be of use to
those doctoral students who have already pondered broad methodology.
The hands-on skills in model building can be attained without it.
This attempt to systematize some aspects of some models should not
fool us into thinking that this is the entire range of model building.
When faced with a possible relationship between two or more factors or
quantities, the nature of the problem should drive our thinking. Model
building should not be reduced to looking for the most suitable format
in an existing grab bag, even while we inevitably do give a chance to
those formats we already know best. One has to think, and the model
depends on the assumptions made. Whether these assumptions are
adequate can be evaluated by the degree of fit to data. (Note that I prefer
to use “factor” or “quantity” rather the mathematical term “variable”.)
26. Comparing Models
An attempt at classifying quantitatively predictive logical models and
formats
Table 26.1 tries to classify those logical models and formats mentioned
in this book. This is of course only a small sample of basic forms, and
they do not fit neatly into a table. Note that these are quantitatively
predictive models, not merely directional ones. Note further that we do
not deal with linear data fits, which some people call “empirical
models”; those fits are not predictive but merely “postdictive” for the
specific data on which they are based.
Table 26.1. An attempt at classifying quantitatively predictive logical
models and formats mentioned in this book.
Static
Rate
Formats
Equations
Ignorance- Normal distribution
Based,
Lognormal distribution
1 variable
Mean of limits: m<x<M 
g=(mM)1/2 several variables
Other Processes/
Mechanisms
Ignorance- 2 quadrants allowed  Exponential
: dS/dt =kS
Based,
Zone in 2 quadrants  Simple logistic : dS/dt =kS(1-S/M)
2 variables
1 quadrant allowed: y=Axk
Zone in 1 quadrant: anchor
and floor/ceiling  Exponential, preferably
Box: l<x<L and m<y<M 
0<X<1and 0<Y<1
Box, 2 anchor points: Y=Xk
Communication
channels
SemiJournal circulation
Substantive
Cabinet duration
Assembly size
Box, 3 anchor points
World population
Box, 3 anchors and bias
Trade/GDP
Systematics of k in Y=Xk,
in S=S0ekt, etc.
Substantive
Interconnecting models: CONNECTIONS AMONG CONNECTIONS
230
26. Comparing Models
Some distinctions that may create more quandaries
than they help to solve
We may distinguish among logical models where a mechanism or
process links one factor to another, and logical formats that avoid
absurdities but otherwise remain black boxes: They work, but we do not
pin down the process that connects the factors.
We may also distinguish between deterministic relationships where a
given input leads to a precise output value, and probabilistic ones where
the output value is only the average output for a given input.
Furthermore, we may distinguish between general formats or models
that appear throughout natural and social sciences, and substantive
models that are specific to one field of study, such as sociology, economics or political science – or even a subfield.
And of course, we can wonder about causality. Which factor causes
which one? Or are they mutually affecting each other? Or are both
affected by a third factor?
Now we apply these notions to the exponential format, S=S0ekt, or
dS/dt =kS in its differential form. Is it a model or a format, deterministic
or probabilistic, general or substantive? It depends.
Microbial growth looks like a prime example of a substantive deterministic model. Imagine a microbe dividing once a day. The process is
repeated every day, as long as conditions (food, space, temperature,
etc.) remain the same. The successive numbers of microbes are 1, 2, 4,
8, 16, 32,... The process of each microbe dividing each day leads to
“Rate of growth is proportional to existing size”, dS/dt=kS. Here we do
have a mechanism: each microbe dividing. And it is deterministic, as
long as each microbe does its part. It further looks substantive,
specifically biological.
Here the biological dividing process does lead to the exponential
equation. The reverse, however, is not always true. Every exponential
phenomenon doesn’t result from a substantive deterministic process.
Radioactive decay does follow the same exponential format,
dS/dt=kS. (The negative value of k makes little difference.) “The rate of
decay is proportional to existing size”. But each atom does not do the
same thing, like each microbe does. Instead of each atom doing
something, they all have equal probability of suddenly decaying. A few
atoms are completely altered, while the others have not changed at all.
We cannot predict, which atoms will decay next – in this sense the
process is probabilistic. But we can predict with high precision how
many atoms will decay during the next time period – this is deterministic.
231
26. Comparing Models
For the same exponential format, bacterial growth involves a process
we can describe, but radioactive decay does not – nor does the decrease
in the effective number of sovereign political entities (Figure 25.2). As
the number of parties increases, interest group pluralism tends to
decrease exponentially, and so does electoral disproportionality
(Chapter 20). Are parties “causing” such decreases? One might as well
ask whether time is “causing” some radioactive nuclei to break apart.
We merely have interrelation, without one specific process, but there
are probably some logical reasons connected to underlying factors.
Is the exponential model deterministic or probabilistic, processbased or not, causal or just relational? This does not matter when it
comes to testing whether a data set fits the exponential equation. More
broadly, such issues are of legitimate interest to philosophers of science.
But one can largely do without worrying about them when the objective
of science is seen as establishing connections among factors, and then
connections among connections.
Ignorance, except for constraints
Some models or formats are based on near-total ignorance of details.
They are general rather than specific to some substantive field such as
sociology or political science, given that field-specific information does
away with near-total ignorance. These models start with normal and
lognormal distributions, and the mean of the extremes, all of which deal
with only one factor. They continue with models connecting two factors
the range of which may be subject to constraints. Finally, models based
on ignorance and constraints on rates of change lead to a puzzling
feature of the exponential function.
Normal distribution – full ignorance
This is the granddaddy of all ignorance-based models. A quantity like
the size of peas has median that is imposed by substantive reasons.
Actual sizes vary randomly around this median, subject to no
constraints. If they can range randomly from minus to plus infinity, a
definite mathematical expression results from this very ignorance, producing a bell-shaped curve. Is the relationship deterministic or probabilistic? For each single item, it is probabilistic, but for a sufficiently large
number of items, the form of the equation is deterministic: the bellshaped curve will materialize, unless there are factors that counteract
randomness.
232
26. Comparing Models
Utterly lacking any specific information, normal distributions occur
in all scientific fields, from physics to biology and social sciences. It
would be pointless to ask for some substantive, specifically social
reasons why some items “decide” to “arrange themselves” according to
a normal distribution. They don’t! There is no mechanism or process.
So, is the normal distribution a model or just a format? This distinction
does not seem to affect the way it is used.
This ignorance-based model transcends the bounds of social
sciences. Substantive implications enter in two ways. First, the mean
values of normal distributions (and their standard deviations) have
substantive origins– we can compare the means of different data sets.
Second, if a distribution fails to be normal, additional substantive
reasons must enter, and we should look for them. A two-humped
distribution in heights of people informs us that both females and males
may be present.
Lognormal distribution – a single constraint
Here a single constraint is introduced: The quantity x is restricted to
positive values. Its logarithm can range from minus to plus infinity,
without any constraints. Hence logx is expected to have a normal
distribution. The equation for lognormal distribution thus results from
the equation of the normal distribution. The resulting curve starts at 0
and goes steeply up; it extends a long tail to the right. The model is
common to all scientific fields. Substantive implications again enter
through comparing the lognormal parameters of different data sets
(which correspond to the means and standard deviations of logx) and
through deviations from the lognormal expectation.
The mean of the limits – two constraints
The mean of lower and upper limits (m and M, respectively) for a
quantity x is the logical best guess for the value of x, in the absence of
any other information. If x must be positive, geometric mean is
preferable: g=(mM)1/2. The mean-of-the-limits model is probabilistic.
All we expect is that it predicts the median result over many cases. In
this sense, it is analogous to the arithmetic mean in the normal
distribution and the geometric mean in the lognormal. When applying
g=(mM)1/2, we would imagine a vaguely lognormal-looking distribution
of actual values around this median.
233
26. Comparing Models
Examples in Chapter 1 include the number of parties in an assembly
elected by unrestricted nationwide PR (n=T1/2), but also riot victims and
the weight of an unknown mammal. The latter example reminds us that
this ignorance model transcends the social realm – it is as universal as
normal distribution. Like for the normal distribution, itb would be pointless to ask for reasons that are specifically political (for parties) or
biological (for mammals). This is precisely the outcome in the absence
of any mechanism or process.
The share of the largest component, S1=S/n1/2 (Chapter 4) offers
another example. Using the relative share s1= S1/S, it could be
reformulated even more compactly as s1=1/n1/2. Later, Figures 21.4 and
21.5 show a situation where the arithmetic mean of the limits makes
more sense than the geometric.
Compared to the single constraint in the lognormal distribution – a
lower limit of 0 – the model based on limits introduces two constraints
on x, lower and upper. Both limits depend on the issue on hand. Thus,
the lower limit is 1 in n=T1/2, but it is S/n, the mean size, in the case of
the largest component. So, while dealing with only one quantity x, like
normal or lognormal distributions, we are slipping toward a relationship
between 2 or even 3 factors. It’s the relationship between x and M when
the lower limit is logically set at 1, or among x, M and m when the
lower limit also can vary (as for the mammals or riots). All the
following models deal explicitly with relationships between 2 factors,
with extensions to 3 and more factors possible.
The dual source of the exponential format: constant rate of change,
and constraint on range
The exponential model is the prime example of a rate equation, a slope
equation, as it was first introduced. The rate of change dS/dt is proportional to existing size: dS/dt =kS. The relative rate of change is the
ratio of absolute change rate and the size: (dS/dt)/S. (The more familiar
percent change rate is just 100 times that.) Now the exponential model
becomes even more succinct: (dS/dt )/S=k – relative rate of change is
constant. This is the fundamental ignorance idea behind the exponential
model: If we do not know whether the relative (percent) growth rate
increases or decreases, the neutral guess is that it stays the same. This
could be called “the rule of conservation of relative rate of change”.
In contrast to this very simple and meaningful differential equation,
the corresponding integrated equation is much more complex even in its
simplest form, S=S0ekt. Figure 25.2 offers an example of exponential
234
26. Comparing Models
decrease over time – the effective number of sovereign political entities
has decreased at a constant relative rate.
Remarkably, the integrated form S=S0ekt also emerges from a quite
different approach. Restrict a factor that changes in time to only positive
values, 0<y<+∞, while time still ranges from minus to plus infinity: ∞<x<+∞. Then S=S0ekt is the simplest equation that respects this simple
constraint.
Restricting one of the two factors to positive values, or imposing a
constant relative rate – these seem utterly different starting points. How
come that the constant rate-based model and the constraints-based
format lead to the same equation? This is worth pondering. (Mathematically, of course, there is no puzzle. But for model building based on
ignorance and constraints, the two starting points look very different.)
Simple logistic change: Zone in two quadrants
Within two allowed quadrants, we can further impose an upper limit on
y. This means 0<y<M instead of 0<y<+∞. The rate equation changes
from dS/dt =kS to dS/dt =kS(1-S/M). The integrated equation produces
a drawn-out S curve. Figure 19.4 compares the exponential and simple
logistic patterns. Note that adding the constraint y<M does not undo the
exponential format – it just grafts a new term, 1-S/M, on the earlier
expression. This is an example of the broader guideline: “The specific
cannot override the general but only can add to it”.
For microbial growth, the logistic pattern initially offers the same
mechanism as for the exponential: each microbe divides. Later growth
toward the ceiling is slowed down by whatever imposes this ceiling –
limits on food, space, and accumulation of waste materials. We may
picture it as if each microbe prolonged its dividing period to an equal
extent, but we may well suspect that the squeeze affects microbes at
different locations differently. The mechanisms offered to visualize
what produces the output of a model all too often involve such an “as
if” – and this is acceptable, as long as we do not forget that it is only an
“as if”.
Limits on ranges of two interrelated factors: Fixed exponent format
When two interconnected factors, x and y, are restricted to positive
values, then only one quarter of the entire two-dimensional field is
allowed: 0<x<+∞ as well as 0<y<+∞. If these two factors are systematically connected, then the simplest form which does not produce
absurdities is the fixed exponent equation y=Axk.
235
26. Comparing Models
Why should we expect such a relationship? It does not evoke any
simple process or mechanism. We are back to the broad principle of
ignorance. If we have logical reasons to assume that y=0 and x=0 go
together (a logical anchor point), then a constant value of k is the neutral
guess as compared to k increasing or decreasing in y=Axk.
An example is journal circulation vs. population for a given language. Per capita circulation is j=2×10-6P (Figure 25.1). This corresponds to total circulation J=2×10-6P2. The specific value k=2 comes, of
course, from the number of communication channels, a model to which
we’ll come.
Zone in one quadrant: Anchor point plus floor or ceiling lead to
exponential change
A surprise awaits us when we impose a further constraint. Within one
quadrant, restrict y, by imposing an upper limit. We still have 0<x<+∞,
but now 0<y<M. We might expect a modification in y=Axk, just like
introducing an upper limit in two quadrants modified the exponential
into simple logistic. But no, here the exponential pops up again. Indeed,
when the anchor point is at (0,M), with M≠0, the format y=Axk cannot
fit, while the exponential S=Mekx does.
When the anchor point is at (1,M), as is the case when x stands for
the number of parties, y=Axk could also work. However, the exponential
form remains preferable because of the appealing simplicity of a
constant relative rate of change (see Chapter 20).
Most instances where we had occasion to use the exponential equation did involve
an anchor point and a floor in one quadrant, so that a competing model based on y=Axk
remains possible. They include the following cases.
In Figure 20.1, interest group pluralism I decays exponentially as the number of
parties increases, starting from its maximum possible value at N=1 and approaching but
not reaching a floor at I=0. The suggested fit I=4e-0.31(N-1) follows the format S=S0ekx
once we set x=N-1. The corresponding relative rate equation is (dI/dN)/I=-0.31 –
meaning that I decreases with increasing N at a constant percent rate. In Figure 20.2 too,
electoral disproportionality D decreases with x=N-1 at a constant percent rate. This
format is harder to visualize here because D has been placed on the horizontal axes and
N on the vertical.
In contrast, the refined model for electoral volatility in Figure 11.2 involves an
exponential increase to the ceiling as the number of parties increases. When counting V
as fraction of 1 (rather than percent) the broad form is V=1-e-k(N-1). The equation for the
slope can be calculated: dV/dN=-k(1-V). This expression is not very illuminating. However, call the complement of volatility “steadiness”, S=1-V. Then dS/dN=-kS results: As
N increases, voter steadiness decreases proportional to steadiness itself. We can
generalize. Whenever a quantity y starts from 0 and approaches a ceiling M, it is easier
to visualize its complement M-y, which starts from M and approaches a floor at zero.
This complement changes proportional to its existing size.
236
26. Comparing Models
In none of the cases above can we pin down a specifically political process or
mechanism for how the number of parties affects the various outputs. Indeed, we should
not look for such a mechanism any more than we do when encountering a normal
distribution, because the reason is more universal. The exponential format is what any
quantities tend to follow, on the average, when constrained to move from an anchor
point to a floor or ceiling. Under these conditions, specifically social factors make
themselves apparent when the exponential format is NOT followed. In all the cases
considered here, random scatter is so wide that the exponential pattern cannot be
rejected.
Box in one quadrant, plus two anchor points
For the zone in one quadrant, we have 0<x<+∞ and 0<y<M. Now
impose an upper limit on x too: 0<x<L as well as 0<y<M. The box thus
formed can be transformed into a square box, ranging from 0 to 1, by
simple transformations X=x/L and Y=y/M. So we have to deal only
with this square box.
In such a box, we usually expect that, for a given values of X, a value
of Y exists – and only one. Most often we also expect the reverse – that
for a given values of Y, a value of X exists. For all values of Y to occur
at some value of X, and vice versa, diagonally opposite corners of the
box must be anchor points: either 0,0 and 1,1, or 0,1 and 1,0. In the
latter case, a shift to Y’=1-Y takes us to the former, so we have to deal
only with the anchor points at 0,0 and 1,1.
The simplest equation for a curve joining these points is Y=Xk, which
is a special case of the simplest model for the entire one quadrant
allowed, y=Axk. However, the possibility of 1-Y=(1-X)k must also be
kept in mind. If this leads to a better fit, then the transformations X”=1X and Y”=1-Y return us to Y”=X”k, so we have to deal only with Y=Xk.
In all such cases the diagonal Y=X offers a useful comparison line. A
format also exists that is symmetric in X and 1-X, as well as in Y and 1Y, but it is more complex.
Examples of relationships following this model range from fictitious assemblypresident approval ratings and the very real support for the US Democrats in 1996 and
2000 (Chapter 8) to support for capitalism and democracy (Chapter 9), and relations
between bicameralism and federalism (Figure 18.3). Exercises also suggest it for female
vs. male literacy (Ex. 8.2 and 9.2), gender equality vs. tolerance of homosexuality (Ex.
9.4), and measures of centralization (Ex. 10.1).
Box in one quadrant, plus three anchor points
Up to now, all the constraints have been general, outside any specific
scientific field. Hence the resulting models are universal, given the type
of constraints imposed. When introducing a third anchor point in the
box, we arguably become more substantive. In Figure 9.4, the anchor
237
26. Comparing Models
point (0.5,0.5) was introduced specifically in the context of two-party
elections. The other example (Figure 21.3) deals with conservatism of
voters and representatives, again in a two-party context. The third
anchor point is based on parity of output, in the presence of equal inputs
by two groups. Here we may deal with an issue specific to political
science, given that one is hard put to offer examples outside party
politics. (The relationship between urban and rural radio set ownership
in Exercise 14.1 is rather hypothetical.)
Does this make the resulting model, Y/(1-Y)=[X/(1-X)]k, a substantive one? I doubt it. Once the extra anchor point at (0.5,0.5) is posited,
on whatever grounds, broad considerations again take over: What is the
simplest way to wiggle through the three points, on the basis of universal concerns of smoothness and continuity?
The kinky variant of a model for the representative’s positions (Chapter 21) could
be said to involve more of a political process, in that the representative tries to balance
off various voters.
Introduction of bias (Figure 22.2 and Exercise 22.1) offers a way for substantive
factors to enter. There must be something that shifts the inflexion point off center. This
point stops being a logical anchor point. Once the general degree of bias is stipulated,
however, through parameter b, the model Y/(1-Y)=[Xb/(1-Xb)]k again looks for the
simplest way to wiggle through. There is no substantive mechanism, beyond whatever
determines b. Does this matter? If a cat does catch mice, how concerned should we be
about its color? Recall that even exponential change can be connected to a mechanism
such as bacterial division only in selected cases. This is no reason to refrain from
making use of it.
Communication channels and their consequences
The number of communication channels among n actors finally seems
to offer us a substantively social model: c=n(n-1)/2≈n2/2. The term
“Actors” feels social, and the mechanism is clear: With n actors present,
the addition of another one adds n new channels – one toward each of
the existing actors. However, similar notions are used also in communication engineering and biology of contagious diseases – they are not
specifically social. Counting channels among actors, after all, boils
down to counting links among points.
When the number of actors is so large that quasi-continuity can be assumed, this
corresponds to another rate equation: dc/dn≈n. This leads to dc/dn≈21/2c1/2 – the number
of channels increases with increasing number of actors at a rate proportional to the
square root of the number of channels.
The most direct application is circulation of journals in a language
spoken by P people. If journals mean communication channels, the
number of journal copies might be proportional to population squared:
J=kP2. In the case tested (Figure 25.1) it works out. But this is a very
remote approximation of what “really” ties journal copies to people. So
238
26. Comparing Models
the actual mechanism of social interaction is at best replaced by a very
iffy “as if”. One may easily conclude that it was fortuitous that the data
fitted such a naively logical expectation – but then we are back to
square one. Or one may conclude that some essential feature was
nonetheless captured – and then we may have something to build on.
In C=42 years/N2 for cabinet duration (Chapter 6), k=2 also derives
from the number of communication channels among N actors. Shifting
from real actors to the effective number of parties is a leap of faith,
which paid off. The substantive mechanism of the connection is limited
to the probability that the number of parties affects duration inversely.
The cube root law of assembly sizes (Chapter 19) takes into account
the number of communication channels at three different levels and
minimizes their total. Among the models using communication
channels, this one might be the most substantive, provided one tolerates
the numerous simplifying assumptions involved.
Population and trade models
World population model (Figure 25.3) combines three-fold interaction
of population, technology and Earth’s carrying capacity. This is a set of
3 rate equations, which have not been presented here. This combination
has some substantive basis.
For the trade/GDP ratio, various approaches have been surveyed,
including the basic rate equation dI/dr=-kI from physics, for absorption
of any flow intensity I over distance r. The problem is eventually placed
within a square box (Figure 25.6). The diagonally opposed anchor
points make logical sense, but one is left wondering what imposes the Sshaped curve joining them, with its inflexion point. It isn’t clear how
well the simple bias format Y/(1-Y)=[Xb/(1-Xb)]k fits the data. In sum,
substantive mechanisms enter, but more work is needed.
Substantive models: Look for connections among connections
Development of truly substantive social models, which introduce
specifically social processes or mechanisms, depends very much on
interlocking of fairly universal partial models. What does this mean?
Exercises 10.1 and 10.2 give a hint of how the constants k in several
equations of form Y=Xk are interconnected in the case of various
measurements of centralization. Discussion of support for US Democrats in two elections (Figure 8.5) points out possibilities for comparing
many elections over time and maybe in many countries. Why has not
more been achieved in this direction? The database for values of k is
239
26. Comparing Models
lacking. Such a database will come about when social scientists stop
putting meaningless regression lines across data clouds, which obviously beg for a fit with Y=Xk, and start fitting with a format which respects
logical constraints.
Empire growth (Figure 25.2) offers an example of connectedness
between rate constants k in exponential patterns N=N0ekt. The effective
number of polities on the basis of area (NA) is larger than that based on
population (NP), because empires tend to form in more densely
populated regions. Going by the geometric mean of extreme
possibilities, the model NA=NP2 emerges. This also means that kA=2kP in
N=N0ekt, which shows up as a doubly steep slope in the logN graph in
Figure 25.2.
We need more interconnections on such a level – models to interconnect constants in models. This involves models of a similar type,
such as all boxes or all exponentials, connecting variables of a similar
type.
We also need to interconnect models of different types, which use
different variables. Chapter 10 offers the sequence Tns1NC.
Each arrow stands for a separate model, joining two or three variables
analogous to n=T1/2, s1=1/p1/2, and C=42 years/N2 presented in this
book. The actual string of models starts with two separate basic inputs:
assembly size and electoral district magnitude. Many more than two
quantities are connected here, through a string of equations. The nature
of the underlying models varies: while s1=1/p1/2 is rooted in the square
root of the extremes, C=42 years/N2 results from the number of
communication channels.
Such strings and networks of interconnected models form the backbone of physics, and social sciences need more of them. The existence
of the string shown here proves that they can exist in social sciences.
This is where the quest for specifically social mechanisms begins:
connections among factors,
and then,
connections among connections.
240
APPENDIX A
What to Look for and Report in Multivariable Linear Regression






Use regression only for exploratory research or for testing
logical models. Don’t even think of using regression for model
construction itself.
Graph possibly meaningful relationships, so as to avoid running
linear regression on curved data.
Guard against colinearity or “co-curvilinearity”.
Use Occam’s Razor: Cut down on the number of variables.
Distinguish between statistical significance (the “stars”) and
substantive meaningfulness.
Report not only the regression coefficients and the intercept but
also the domains, medians and means for all input variables.
In Chapters 15 and 16, we considered linear regression of one variable
against another – y vs. x. We can also regress an output variable against
several inputs: zx,y – but this is a one-directional process, with all its
pitfalls. Symmetric regression of 3 variables x, y and x is still being
worked on. In social science literature we often encounter tables that
report the results of multi-variable linear regression. Those numerous
numbers seem to involve useful information – why else would they be
published? But what can we actually read out of them?
We’ll first approach the issue from the viewpoint of someone who
wants to make use of published multi-variable regression tables so as to
gain social insights. Thereafter, we’ll ask what we should do when
running such a regression ourselves. When reporting the results, we
should give enough information so that readers can understand – and
possibly carry out some further analysis. Only this way can bits and
pieces of knowledge become cumulative science.
The principle of multi-variable regression is the same as for single
variable regression (y vs. x), but more than one input variable is fed in.
The output is usually construed as y=a+b1x1+b2x2+…, but it really should
be shown as y a+b1x1+b2x2+…, because the relationship is valid in
only one direction (recall Chapter 15). Symmetric multivariable
regression is difficult, and mathematicians still work on it. The equations
I tentatively offered (Taagepera 2008: 174-175) are plain wrong. So we
are reduced to directional regression, with all its risks. There is still quite
a lot we can do, provided that major pitfalls are avoided.
The main purpose of determining a regression line is to estimate the
output for given inputs. Given the values of inputs x1, x2,…, we should
APPENDIX A
be able to deduce the corresponding most likely value of y from y
a+b1x1+b2x2+…. But this is not all we are interested in. Maybe scatter
of y, for given x1, is so wide that the impact of x1 is “statistically
insignificant” and we should omit x1. On the other hand, maybe x2 is
highly significant in the statistical sense, but it varies over such a short
range that it hardly affects y, given the wide range of y. Could we then
overlook x2? We may also be interested in what the median value of y
is, and how large or small y can get. Maybe we have the values of x2,
but the values of x1 are hard to get. Can we still estimate y, and how
well? The importance of such questions will become clearer as we walk
through an actual example.
Making use of published multi-variable regression tables: A simple
example
Table A.1 shows a part of an example presented in Making Social
Sciences More Scientific (Taagepera 2008: 207), based on a regression
in Lijphart (1994). The output NS is the effective number of assembly
parties. It may be logically expected to depend on two factors: How
high is the “effective threshold” (T) of votes at which a party is likely to
win a seat; and the size of the representative assembly (S), meaning the
number of seats available. As the latter can vary over a wide range of
positive values (from 60 to 650), it is likely to be distributed lognormally rather than normally, and hence logS is used in regression,
rather than S itself. What do these numbers mean, in Table A.1? What
can we conclude or deduce from them?
Table A.1. Effective number of assembly parties (NS) regressed on
effective threshold (T) and logged assembly size (logS).
Independent
Domain
Mean
Median
variables
(Range)
Effective threshold (T)
0.1 to 35 11.6
7.0
Log assembly size (logS) 1.8 to 2.8
2.2
2.2
Intercept
R2
Adjusted R2
*
: statistically significant at the 5 per cent level.
**
: statistically significant at the 1 per cent level.
242
Coefficients for
NS
-0.05**
0.12
3.66
0.30
0.28
APPENDIX A
These coefficients mean that NS can be estimated from
NS 3.66-0.05T+0.12logS.
The two R2 values are measures of scatter around this directional
equation. This equation accounts for 28% of the variation in y. The
remaining 72% remain random scatter in the sense that threshold and
assembly size cannot account for it, at least the way they are entered in
the regression. (“Accounting for” in a statistical sense must be
distinguished from “explaining” – only a logical model can explain the
process by which T and S affect NS.) The two stars at T indicate that this
factor has a definite impact on the output. The lack of stars at logS
indicates that scatter is so large that it is statistically uncertain whether
logS has any systematic impact on the output.
If so, then we might as well omit logS altogether and estimate from T
alone. Scatter might be hardly reduced. It might be tempting to use the
equation above, just dropping the logS term: NS 3.66-0.05T. Wrong.
This would imply that we assume that logS=0, hence S=1. This would
be ludicrous: no national assemblies are that small. We need to replace
the omitted factor not by 0 but by its mean value. Unfortunately,
mean values are all too often omitted from reported regression results.
Hence, to calculate the output for a given threshold value, we have also
to locate the assembly size, even while it is stated that assembly size
lacks statistical significance!
The table above does add the mean, so that we can drop logS, if we
so wish. This mean is 2.2, corresponding to assembly size S=160. The
best estimate of the output becomes NS 3.66-0.05T+0.12(2.2). Hence
NS 3.92-0.05T.
The difference between 3.92 and 3.66 may not appear large, but given
that few countries have fewer than 2.0 or more than 5.0 parties, a
difference of 0.3 is 10% of the entire range.
The table further adds the median and the domain, meaning the
range from the smallest to the largest value. Why are they needed?
Regression makes sense only when the variables are fairly normally
distributed, so that their medians and arithmetic means coincide. This is
the case for logS. (For S itself the mean would exceed the median
appreciably.) For T, the mean exceeds the median by almost 5 units. Is
the gap excessive? It depends on how widely a variable is observed to
range – its domain. The domain of T goes from near-zero to 35, so a 5unit discrepancy is appreciable. We might obtain a better fit for the
243
APPENDIX A
number of parties, if we carried out the linear regression on the square
root of T rather than T itself.
Actually, the distribution of T here is not just a peak with a longer tail in one
direction – it has two separate peaks. Hence the very use of regression becomes
problematic. Once we omit logS, leaving only one input variable, it would be high time
to graph NS against T, see what the pattern looks like, and try to express it as an
equation.
There is another reason for reporting the domain. Researchers sometimes neglect specifying the measures they use. Did “logS” mean
decimal or natural logarithms? When the domain is given, it becomes
clear that decimal logarithms are used, because the corresponding range
of S would be 60 to 650 seats, which is reasonable. If logS stood for
natural logarithms, S would range from 2 to 16 seats!
Often there are several ways to measure what looks the same conceptually. For
instance, cabinet duration is observed to range from a few months to 40 years by a fairly
lenient measure, but only up to about 5 years by a more stringent one (which resets the
clock whenever there are elections). How am I to know? Well, if the domain for cabinet
duration is given as 0 to 5, then I have a pretty good idea of which indicator has been
used. Also, one often talks of “corruption index” when actually using an index of lack of
corruption (so that honest countries have highest scores). Authors often are so used to a
given measure that they neglect to specify it – or they even mis-specify, as for
corruption.
Exercise A.1 shows what further information one can glean from
regression results, provided that the mean, median and domain are
included.
Exercise A.1
Use Table A.1 for the following.
a) Calculate NS for the median values of T and logS. One can
expect the median value of NS to be close to this result.
b) What is the lowest value of NS that could result from the
extreme values of T and S on the basis of this regression line?
c) What is the highest value of NS that could result from the
extreme values of T and S on the basis of this regression line?
d) By how much do these extreme values of NS differ from its
presumed median. Comment on what it implies.
e) Compare the extents to which T and S, respectively, are able to
alter NS. Could you have predicted it just by looking at their
respective coefficients?
f) Given that the impact of logS is not statistically significant, we
should be able to ignore it and still get basically the same result.
How would the expected range of NS change?
g) Which of the previous questions could you answer, if only the
columns for “Coefficients” were reported?
244
APPENDIX A
Guarding against colinearity
Suppose we have an output z that might depend on some factor x and
possibly also on another factor y. (I owe to Kalev Kasemets this
example based on actual published work.) We run simple OLS z on x
and find
z = 0.634(±0.066)x + 0.789(±0.197)
R2=0.828
p(x)<0.001.
The “(±0.066)” indicate the possible range of error on the coefficient of
x. The low value of p(x) says roughly that chances are extremely low
that the correlation is random chance. And R2=0.828 indicates that 83%
of the variation in z is accounted for by variation in x. This looks pretty
good.
Now run OLS z on y, and the outcome is
z = 0.325(±0.027)x + 0.602(±0.169) R2=0.887
p(y)<0.001.
This looks like even a slightly better fit: R2=0.887 indicates that 89% of
the variation in z is accounted for by variation in y.
We get greedy and feed both of them into multi-variable linear
regression. With two inputs, we should obtain an even better fit. We get
z = -0.208(±0.268)x + 0.426(±0.133)y + 0.0577(±0.174)
R2=0.890 p(x)=0.447, p(y)=0.005. This R2=0.890 is practically no
higher than 0.887 for y alone. Most surprising, the impact of x now
looks negative! It reduces z rather than increasing it! The p(y) is higher
than before, meaning a higher probability that the correlation between y
and z is random chance. The p(x) is disastrously higher than before,
showing an almost 50-50 probability that the correlation between x and
z is random chance.
So which way is it? Does x enhance z in a highly clear-cut and
significant way, or does it reduce z, in a quite uncertain way? We cannot
have both.
The answer is that the two inputs must be highly correlated themselves. Indeed, the two inputs together cannot account for 83+88=171%
of the variation in the output! The inputs must be correlated by at least
71%, roughly speaking. What happens if we ignore this “colinearity” of
x and y, and plug them both into a multi-variable regression? They
destroy and confuse each other’s impact. Among the two, y has a
slightly higher correlation with z. It sort of cannibalizes the effect of x,
reducing it to nothing. The small negative and uncertain coefficient of x
245
APPENDIX A
does NOT show the total effect of x on z – it shows the residual effect of
x, once its effect through y has been accounted for.
We effectively have a causal chain xyz or maybe rather
(y≈kx)z. By plugging both x and y into the same regression we
arbitrarily assumed that causality is xzy – meaning x and y affecting
z separately, which is here a blatantly false assumption. The computer
program does not argue with you. I you feed in junk, it obediently
processes it, but “junk in, junk out”.
What are we supposed to do? In the present case, where both inputs
account for more than 80% of variation in output, it’s fairly simple: use
only one. Which one? One might pick the one with the higher R2, in the
present case y, which tacitly implies that x acts through y: xyz. But
this is not the only consideration. If we have logical reasons to assume
that the causal process is yxz, then a small shortfall in R2should not
deter us from using x. Also graph z vs. x, z vs. y and z vs. x. Curvatures
or irregularities in the data clouds may give you hints on how the
variables are related.
The real hassle comes when the inputs are only mildly correlated–
not R2=0.70 but R2=0.30. Here one of the inputs may act on the output
both indirectly and also directly: xy zx. There are statistical ways
to handle such situations, but also graph the data. The main thing is: If
it looks odd, it probably is – then it’s time to double check or ask for
help. Do not report absurdities, without expressing doubts.
Running a multi-variable linear regression
Suppose that many factors come to mind, which could conceivably have
an impact on the values of some output variable y. Among these, A, B,
C, D, E and F are the prime suspects. Furthermore, the output might be
different for women and men. As a first exploratory step, we might run
multi-variable OLS regression. It may suggest which factors have a
definite impact and thus guide our search for a logical model. Once a
model is constructed, its testing may need another regression. The
process includes the following stages.
Processing data prior to exploratory regression,
Running exploratory regression,
Reporting its results,
Re-running exploratory regression with fewer variables,
Graphing the prediction of the regression equation against the actual
outputs.
Model-testing regression, once a logical model is devised.
246
APPENDIX A
Processing data prior to exploratory regression
It matters what we feed into the computer. All relationships are not
linear. Before applying linear analysis we better do some thinking.
Instead of feeding in x, maybe linearity is more likely with 1/x – or even
1/x2. (The latter is the case between cabinet duration and the number of
parties, for instance.) When there is linearity between y and 1/x2, then
there is no linearity between y and x. If we still regress on x, we would
fit a curved data cloud with a straight line. We would obtain some
correlation but not to the full extent possible. Most important, we would
miss the logical nature of the connection.
But what can we do, short of working out elaborate logical models?
At the very least, consider the conceptually allowed ranges. If factors A
and D can in principle range from minus to plus infinity, enter them as
they are. But if factors C and D can take only positive values, it is safer
to enter logC and logD.
Also consider whether all factors are mutually independent. We
talked about guarding against colinearity. This is worth repeating.
Suppose input variables D and E are strongly connected through D=abE. Then one of them should be left out. Which one? Before answering
this question, also consider “co-curvilinearity”: Maybe D and E are
even more strongly connected through D=a/E2. How do we know?
Graph each potential input variable against each other. If the data cloud
looks like a fat ellipse, almost circular, then the two variables are
independent and can be used together. But if the data cloud looks like a
thin tilted ellipse or bent sausage, then the variables are interconnected
and we better keep only one of the two. But which one should it be?
If y may be affected by D and E, which are interrelated, then it might
be that one of them acts through the other:
E D y OR D E y.
Which one could it be? Make graphs and run correlations y vs. E and
also y vs. D. The one with the higher R2 (once data clouds are
straightened out) is likely to be more directly connected, and this is
most often the one to keep. True, it could be that one of the factors acts
on y both indirectly and also directly: E D y  E, but one of them
is likely to predominate. Let us keep it simple, if we can.
Running exploratory regression
Suppose co-curvilinearity eliminates factors E and F. We are left with
A, logB, logC and D. We run multivariable regression on them. Suppose
247
APPENDIX A
we report the results, using today’s standard format (Table A.2). Instead
of just R2, there may be a somewhat differently labeled coefficient. Stars
indicate the strength of supposed significance in a statistical sense,
which often is misinterpreted (Taagepera 2008: 77-78). The computer
printout may have other statistical features that will not be discussed
here.
Table A.2. Typical minimal reporting of multi-variable linear regression analysis.
Factor A
Factor B (log)
Factor C (log)
Factor D
Dummy (F=1)
Intercept
R2
-0.03***
0.12**
0.28
3.77*
0.74*
4.07
0.39
What this table means is that the values of y can be best predicted from
values of A etc. by applying the equation of format ya+b1x1+
b2x2+…and plugging in the coefficient values shown in the table. For
males, it is
y  4.07- 0.03A + 0.12logB + 0.28logC + 3.77D.
For females, add 0.74. Together, these variables account for 39 % of the
variation in y, as R2 tells us. The number of stars suggests that A affects
y most certainly, followed by logB, while the impact of D is less certain,
and the impact of logC could well be random chance. Whether gender
may have some impact also remains in doubt.
Lumping less significant variables: The need to report all medians and
means
Occam’s Razor is a principle that tells us to try to prune off everything
that is not essential. (Recall Albert Einstein’s advice: make your models
as simple as possible – and no simpler.) Which factors should we
discard? On the face of it, we should keep only the two most significant.
Then the equation above might seem to be reduced to y4.070.03A+0.12logB – but not so fast! By so doing, we would assume that
248
APPENDIX A
the mean values of logC and of D are 0, which might be widely off the
mark. As pointed out earlier, we must plug in the average values of
logC and of D. Also, assuming roughly equal numbers of females and
males, we should add 0.74/2=0.37. Thus the reduced equation would
result from
y  4.07-0.03A+0.12logB+0.28(aver. logC)+3.77(aver. D)+0.37.
But what are “averages” – medians or arithmetic means? And have we
reported their values? Without these values, anyone who wants to make
use of our findings to predict y would have to enter into the equation not
only the values A and B but also C and D. He would have to dig up the
values of C and D, even while our analysis concludes that they are
rather insignificant! Indeed, for this reason all too many published
regression results are useless for prediction: Too many variables
are shown, and their average values are not. So we better report the
averages.
As pointed out earlier, we better report both median and arithmetic
mean, to give the user a choice, but mainly for the following reason. If
median and arithmetic mean differ appreciably, this would indicate that
the distribution of values could not be normal. This means that
assumption of linear relationship is on shaky grounds – and we should
warn the readers. Actually, we should try to transform our data, prior to
linear regression, so that median and arithmetic mean would be roughly
equal.
Table A.3. Multi-variable linear regression results, when also reporting
averages. RC=Regression Coefficient. Factor median weights (median×
RC) emerge.
Factor A
Factor B (log)
Factor C (log)
Factor D
Dummy (F=1)
Intercept
R2
Median
1.20
0.50
6.5
0.40
0.5
Mean
1.23
0.57
6.4
0.37
0.5
RC
-0.03***
0.12**
0.28
3.77*
0.74*
4.07
0.39
Median×RC RC after lumping
-0.036
-0.03***
0.06
0.12**
1.82
--1.51
--0.37
--7.77(=4.07+1.82+1.51+0.37)
Suppose we have done the needed transformations and can happily
report the results in Table A.3, where medians and means roughly
agree. I have also included the product of each factor’s median and its
249
APPENDIX A
regression coefficient. This is the median weight it contributes to y. If
we drop factors C, D and gender, then we must add these weights to the
intercept. The new intercept value is 4.07+1.82+1.51+0.37=7.77. So the
equation with some variables lumped is
y  7.77 - 0.03A + 0.12logB,
as reflected in the last column of Table A.3.
Re-running exploratory regression with fewer variables
If at all possible, we should now carry out a new regression, using only
A and B. It should yield an intercept close to 7.77 and coefficients close
to the previous ones. If this is not so, there is something in the data
constellation that we should check more closely. The value of R2 can be
expected to drop below 0.39 because we no longer account for the
variation due to C, D or gender. If the drop is appreciable, we may have
to reconsider.
Indeed, the drop in R2 may be serious. Look at the factor weights in
table above: They are large for C and D (1.82 and 1.51), while tiny for
A and B (-0.036 and 0.12). How much impact could such tiny inputs
have on the output? The question is justified, but to answer it, we must
also take into account how widely the variables are observed to range.
We saw that this is called their domain.
Report the domains of all variables!
Table A.4 adds the observed domains of variables. How widely could
the estimates of y vary? We must consider the extremes of inputs,
keeping track of the signs of their extreme values and coefficients. The
lowest value of y results from ymin = 1.5(-0.030)-2(0.12)+4(0.28)0.4(3.77)+0 = -0.67. The highest value of y results from ymax = 0.9
(-0.030)+3(0.12)+9(0.28)+1.2 (3.77)+0.74 = 8.12. The actual domain of y
is likely to be somewhat smaller, because extreme values of inputs
rarely coincide, but clearly y can vary over several units.
250
APPENDIX A
Table A.4. Multi-variable linear regression analysis, also reporting the
domains.
Median
Factor A
1.20
Factor B (log) 0.50
Factor C (log) 6.5
Factor D
0.40
Dummy (F=1) 0.5
Intercept
R2
Mean
1.23
0. 57
6.4
0.37
0.5
Domain
0.9 to 1.5
-2 to +3
4 to 9
-0.4 to +1.2
0 to 1
Regr.coeff.
-0.030***
0.12**
0.28
3.77*
0.74*
4.07
0.39
Span
0.6
5
5
1.6
1
Span×RC
-0.018
0.60
1.40
6.03
0.74
The “span” is the extent of the domain, the difference between the
largest and smallest values. Then (Span×Regression Coefficient) is the
extent by which the given factor could alter y. While it is highly likely
that A has an impact on y (3 stars!), this impact is tiny. Even the extreme
values of A would alter y by only 0.18. If we dropped A, our ability to
predict y would hardly be affected. In contrast, variation in Factor D
alone determines most of the domain of values y could take, if the
impact of D is real. Trouble is, it is uncertain whether D has any definite
impact.
We should distinguish between statistical significance (the “stars”)
and substantive meaningfulness. In terms of health issues, suppose
presently 54,000 people per year die of a given disease. Suppose some
rather expensive and painful medication reliably reduces mortality – to
53,000. Is a drop of 1,000 worth subjecting 54,000 people to this treatment? Now suppose that the uncertain Factor D can easily be altered –
like making drinking water slightly more or slightly less acid. If it
works, it would have a large impact on mortality, but we are not quite
certain it would even have an effect (single star).
Which factor should we focus on in the present example? I have no
simple answer. It depends on many other considerations. But one thing
is certain: Reporting only the information in Table A.2 could badly
mislead readers into thinking that Factor A is the best predictor, just
because it has the most stars. Whenever one runs a regression, one has
the possibility to determine the domains, medians and means of all the
input variables. Do report this information. Omitting the explanatory
columns (Span and Span×RC) in Table A.4, our table of regression
results should look like the one in Table A.5. Then the reader can play
around with it, weighing the impact vs. significance of various factors.
When she has access to the values of only some of the factors, she can
251
APPENDIX A
still make predictions, using the mean or median values for missing
factors.
Table A.5. Recommended format for reporting multi-variable linear
regression analysis.
Factor A
Factor B (log)
Factor C (log)
Factor D
Dummy (F=1)
Intercept
R2
Median
Mean
Domain
Regr.coeff.
1.20
0.50
6.5
0.40
0.5
1.23
0.57
6.4
0.37
0.5
0.9 to 1.5
-2 to +3
4 to 9
-0.4 to +1.2
0 to 1
-0.030***
0.12**
0.28
3.77*
0.74*
4.07
0.39
Graphing the predictions of the regression equation against the
actual outputs
The advice “Always graph the data” may become unfeasible when more
than 2 factors interact. Whenever we graph two of them against each
other, the interference by other factors can completely blur out a very
real relationship. But one can always graph the predictions of the
regression equation against the actual outputs. We would expect the
resulting “data” cloud to be evenly scattered around the equality line:
yexpected=yactual, if the regression is any good. This “data” cloud should
look like an ellipse. If this is not the case, we should look into what
causes the irregularities.
When we regress yexpected against yactual (preferably using symmetric
regression), we obtain yexpected=a+byactual. Here a should be very close to
0, and b should be very close to 1. If this is not the case, we should look
for ways to improve the original multivariable regression. Maybe some
input factor should be transformed before feeding it in. Maybe some
parts of the data produce “islands” far away from the equality line and
should be treated separately.
Always graph the predictions of a model or data fit against the
actual data – this is so easy to do that it should become standard
procedure in multivariable regression in particular. This would supply a
clear and simple picture of the quality of the regression.
252
APPENDIX A
Model-testing regression
All the preceding refers to exploratory regression – trying to get some
idea of what might affect a given output. It may help to focus on just a
couple of inputs. The next step would be to ponder how these factors
might impact the output. This means trying to build a logical model.
More graphing may be involved – and don’t even think of using
regression for model construction itself! It is unlikely that the predictive
model would include more than 1 to 3 input variables. All others are
likely to act through these or be negligible, at least in a first approximation.
By this time, we no longer accept just any regression coefficient
values. Suppose the logical model is y=KAB/C2. We know we have to
regress logy against a=logA, b=logB and c=logC. But we also know that
the result must be logy=k+1.00a+1.00b-2.00c – only the intercept
k=logK is not predicted. What do we do when the coefficients found
differ appreciably from 1, 1, and -2? The answer depends on the
specific situation. For the moment, let us just keep in mind the difference between preliminary regression (preliminary to model building
effort) and final regression (testing the model).
Substantive vs. statistical significance
High statistical significance alone could be pointless for making sense
of the world. We must look for substantive significance. This point was
briefly made in previous chapter as well as in Taagepera (2008: 77–
78 etc.) Table A.6 is an attempt to visualize a similar message in
Professor McCloskey's The cult of statistical significance
(<http://www.amazon.com/Cult-Statistical-Significance-EconomicsCognition/dp/0472050079/ref=sr_1_1?ie=UTF8&s=books&qid=12655
15998&sr=8-1>)
253
APPENDIX A
Table A.6. Substantive vs. statistical significance
Extent (size) of effect
e.g., by how much a cure reduces the death rate for a
disease
Tiny
Appreciable
e.g., from 39% to 38%
e.g., from 39% to 18%
Low
p<.1
No substantive
significance
Median substantive
significance
Forget it
If it were real, it would
have high substantive
significance.
So try to refine and add
data, so as to raise
statistical significance -the potential payoff is
high!
Also try to build a logical
model, as it would help
you to refine the data.
Low substantive
significance
High substantive
significance
A reduction by 1
percent point is peanuts,
regardless of how
certain it is. BUT don't
yet give up. Try to
elucidate a plausible
underlying
mechanism, i.e., build a
logical model. Then you
might find ways to
enhance the extent of
the effect.
Congratulations!
BUT beyond happy
application, also try to
elucidate the underlying
mechanism, i.e., build a
logical model. Otherwise
you would be at a loss
when the effect suddenly
weakens.*
Statistical
significance
High
p<.01
* At a refractive materials plant, they had a product vastly superior to
those of all competitors. The management was just happy about it and
never bothered to find out why. Then a foreman died, and the quality
plunged, never to rise again, no matter what they tried. By this time it
was too late to ask “why?"
254
APPENDIX B
Some Exam Questions
These are some questions used in recent midterm tests and final exams,
with open books and notes. The “points” reflect the expected time in
minutes to complete them, except for questions that offer data and then
only say: “Do all you can with this information, using skills developed
in this course. Beyond responding to step-by-step commands, show that
you can set up such steps for yourself.” If the student has grasped the
“meta-skills” this book tries to transmit, the response might take little
time. Lacking those skills, it would also take little time before the
student runs out of ideas of what to do with the data.
Why such an open-ended format, with open books and notes ? When
you land a college-level job, your boss will not tell you “Do a), then b),
and then c).” She will tell you: “We have this problem. Find an
answer.” It’s an open-books-and-notes situation. If you return with
empty words, she will not give you half-credit just because she cannot
figure out if there is something relevant hidden in your gibberish.
Instead, she will wonder: “Should I keep this employee?”
These questions are presented here in an increasing order of
estimated length or difficulty. Of course, exercises spread throughout
the book also can serve as exam questions.
1. (4 points) How would you explain the meaning of decimal “logarithm” of a number like 100 or 10,000 to your 8-year-old niece?
2. (4 points) A population grows in time as P=5e0.23t. The constant e
stands for e=2.71828182… How much is P when t=0?
3. (10 points) Both graphs next page (A and B), from a recent issue of
the journal Electoral Studies (vol. 32, pp. 576-588, 2013) show the
citizens’ perception (C) of fairness of elections vs. international expert
perception (E) of fairness of the same elections. All scales are in
percent. The graphs differ in time and method of survey. In p. 584, the
author comments: “By contrast [to certain other countries], both the
Ukrainian and Peruvian elections are regarded far worse by international experts than by citizens living in these societies.”
a) Is this so in Graph A? Why, or why not?
b) Is this so in Graph B? Why, or why not?
Fluffy words (a.k.a. BS-ing) will not take you anywhere. Do not waste
your time on that. Start by applying some of the basic approaches taught
APPENDIX B
in this class, and the answer will be quite short. Add to the graph, to
help show your reasoning.
4. (10 points) Figure 8.5 in Taagepera, Logical Models…, based on a
graph by Johnston, Hagen and Jamieson (2004), shows a curve y vs. x,
where
256
APPENDIX B
y = support for Democrats in individual US states in Presidential
elections 2000, and
x = support for Democrats in individual US states in Presidential
elections 1996.
The relationship is approximately y=x1.18, as the support for Democrats
dropped in 2000. This support for was up again in 2008. Assume that
the pattern was z=y0.91, where
z = support for Democrats in individual US states in Presidential
elections 2008.
a) Calculate the relationship between Democrat support in 1996
and 2008. In other words, express z in terms of x.
b) Was the support for Democrats in 2008 higher or lower than
in 1996 (assuming that z=y0.91 holds)? How can you tell it from
the equation for z and x? (A graph might help.)
5. (12 points) Estimate the number of swimming pools (public and
private) in Los Angeles. (Show your reasoning. No credit for just a
number drawn from the hat.)
6. (12 points) In a developing country, 40 % of boys attend elementary
school, but only 10% of girls do. Developing countries tend to have
warm climates.
a) Calculate the likely percentage of girls attending school at a later
time, when 60% of boys do. (Zero points for guesswork. Place the
information given in a logical framework and calculate on that basis.)
b) At the time 40 % of girls attend elementary school, calculate the
likely percentage for boys.
c) Graph all three points, so as to make sure they lie on the same
smooth curve. Do they?
7. (15 points) Larger countries tend to have larger populations.
Examples:
Trinidad
Denmark
Colombia
USA
Area
(thou. sq.mi.)
2.0
17
440
3700
Population 1995
(million)
1.3
5.2
37
263
a) Graph on suitable scales.
b) Draw in the best-fit line or curve. If you can determine its equation,
good for you, but I really do not expect it at this early stage.
257
APPENDIX B
c) Which slope might one expect, in the absence of any information?
How does the actual slope differ? What might be the reason?
8. (16 points) The graph below shows the actual seat shares for parties
plotted against the seat shares expected on the basis of a logical model
(the nature of which does not concern us here).
a) For “Two-party contests", this model looks like a satisfactory
approximation, because [complete the short sentence] …
b) For “Two-party contests", this model is far from perfect data fit,
because [complete the short sentence] …
c) For “Multiparty contests", this model looks unsatisfactory,
because [complete the short sentence] …
d) Add a smooth curve to fit the “Multiparty contests” data. Do it
quite carefully, so as not to propose absurdities, inadvertently.
9. (15 points) How much could a country's Gross Domestic Product (G)
depend on its population (P) and on its average level of education (E,
258
APPENDIX B
measured as the number of school years for the median person)?
Suppose the following model is proposed, as a first approximation,
when comparing countries: G=aPE1/2.
a) Does the proposed relationship between G and P make sense?
Why or why not? (Consider what would happen to G when a country
splits into two equal parts.)
b) When the median person has at least 6 years of schooling, does
the proposed relationship between G and E make sense? Why or why
not? (Consider what would happen to G when E is doubled.)
c) Also consider what the model leads to in a country with absolutely
no schooling. Suggest a slight correction to the model, to take care of
this difficulty.
10. (14 points) The clipping from Los Angeles Times (25 January 2013)
shows the results of recent elections in Israel. [This graph, NOT
REPRODUCED HERE, shows these numbers of seats for parties: 20,
12, 11, 11, 7, 19, 15, 6, 6, 4, 4, 3, 2; Conservative parties, 61; CenterLeft parties, 59.] Israel is one of the few countries that allocate seats by
nationwide proportional representation (with a minimal “legal threshold
of votes”). Compare this picture with what you would expect when
knowing only that there are 120 seats, allocated by nationwide proportional representation.
If this were a final exam, these would be all the instructions given. But this being a
midterm, further hints follow.
What number of parties would you expect to win seats?
To what extent does the actual number differ?
How large would you expect the largest party to be?
To what extent does the actual size differ?
How large would you expect the largest party to be, if knowing the
actual number of seat-winning parties?
To what extent does the actual size differ?
11. (14 points) Suppose male and female literacies varied as follows,
over time (they roughly do):
Time
Male literacy (%)
Female literacy (%)
t1
16
1
t2
30
5
t3
50
---
t4
81
59
t5
92
81
Determine the female literacy at time t3, if the general pattern holds.
Make use of all you have learned in this course.
If this were a final exam, these would be all the instructions given. But this being a
midterm, a further hint follows: Think inside the box.
259
APPENDIX B
12. (16 points) Consider the equation y=M/[1+e-kt], where t is time, and
M and k are positive constants.
a) How large would y be at time t=0?
b) How large was y in the far past?
How large will y be in the far future?
d) Sketch a graph y vs. time.
13. (14 points) All developed democracies tend to redistribute some
wealth from wealthier to poorer people, through taxation. Some people
like it, some don’t. One may well presume that the extent of redistribution is higher when more people support it. The figure below shows
the extent of redistribution graphed (y) against the support for redistribution (x). Both can in principle range from 0 to 100 %. (Source: Lupu and
Pontusson, Am. Political Science Review 105, p. 329, 2011.)
Does the line shown in the graph make sense at all possible values of
support for redistribution? Offer a better line or curve. [Neglect the
SWZ and SPA points.]
If this were a final exam, these would be all the instructions given. But this being a
midterm, further hints follow.
If no one supported redistribution, how much redistribution could we
logically expect?
What does the straight line shown in the graph represent?
What would it predict when support for redistribution is nil?
Which line or curve would satisfy both data and logical considerations?
How does its degree of fit compare to the fit of the line shown?
260
APPENDIX B
14. (16 points) In an assembly of S=100 seats with p=10 parties
represented, we usually start an estimate of the largest party share (S1)
by observing that it must be between the mean share and the total:
10≤S1≤100. A perceptive student protests that S1=100 is conceptually
impossible, because it would not leave any seats for the other 9 parties
which are also supposed to be represented. The instructor agrees that
this is only a first approximation, which could be refined later on.
a) What value of S1would result from this first approximation?
b) How would you refine the first-approximation approach, to satisfy
this student?
c) What value of S1would result then?
d) The actual largest shares in the Netherlands 1918-52, when S=100
and p was close to 10, ranged from 28 to 32, with a mean of 30.5.
Compare this data piece with the predictions of the coarse and more
refined models. Does the refining of the model seem worthwhile?
15. (20 points) What is the simplest model to try, if you know that the
relationship between y and x is subject to the following constraints.
Give the name of this type of model, and also its equation, if you can
(for some, we have not given it). Sketching a picture might will help to
guide you. Three separate cases:
a) When x=0, y must be 1. When x becomes huge, y approaches 0
b) When x=0, y must be 0. When x=1, y must be 1. In-between, x and
y are not equal.
c) When x=0, y must be 0. When x becomes huge, y approaches
100%.
d) When x=0, y must be 1. When x=1, y must be 0. In between, y and
x need not be equal. (This is a special challenge – a case we have not
discussed explicitly. Bypass this question, if you find the previous ones
difficult.)
16. (14 points) Indicate the simplest model to try, when the relationship
between y and x is subject to the following constraints. In each case,
sketch a picture, give the name of the model, and give its equation.
a) When x=0, y must be 0. When x=1, y must be 1. When x=0.5, y
must be 0.5. In between, y and x are not equal. Give a specific
example where these constraints apply.
b) When x tends toward negative infinity, y tends toward 0.
When x becomes tends toward positive infinity, y tends toward
a maximum value M. Give an example where these constraints
apply.
261
APPENDIX B
17. (27 points) In the graph below, add the following, trying to be
precise. [GRAPH NOT REPRODUCED HERE shows points 3,5; 9,5;
9,8.]
a) Draw in the OLS regression line y vs. x. Label it. (Wiggly lines
are marked down.)
b) Draw in the OLS regression line x vs. y. Label it.
c) Determine the coordinates (X,Y) of the center of gravity of
these 3 points. (This is where precise drawing is needed.)
d) Calculate the slope of the symmetric regression line.
e) Draw in the symmetric regression line quite precisely. Label it.
f) Which data fit should you choose, in the absence of any further
information on the causal relationship between y and x?
g) In the absence of any further information, what would be your
best guess for y at x=0 and for y at x=10?
h) Now suppose you also know that x determines y, rather than
vice versa. Which data fit should you choose now?
i) Now suppose you also know on logical grounds that y=0 goes
with x=0. Add this to the graph and look at it. Which pattern
makes sense now? Draw in the approximate new fit.
18. (20 points) The following data are given, with no further information on logical constraints etc.:
x
0
5
10
y
1
20
400
a) Calculate the suitable means for x and y (i.e., those most likely
to reflect the medians).
b) Graph the data points, y vs. x, on the type of graph paper most
likely to yield a straight line. A variety of graph papers is attached for this
purpose.
c) On the same graph, also place the point for (mean x, mean y).
d) Draw in the best-fit straight line. (Wiggly lines are marked
down.)
e) Write the equation that connects y and x (to the extent the line
in graph holds). If you can give the numerical values of some of
the constants in the equation, so much the better, but symbolic
notation will do.
262
APPENDIX B
19. (21 points) Describe as fully as possible what one can directly
see in this graph.
20. (24 points) Costa Rica's population was 0.3 million in 1900; 0.9
million in 1950; and 4.2 million in 2000. Do all you can with this
information, using skills developed in this course.
21. (40 points) The percentage of people who disagree with the statement “Most people can be trusted” is as follows, depending on their
level of education, Lower or Higher. (The responses of people with
Middle education tend to be in between.) Analyze these data, using all
your skills, and draw conclusions. Data reprocessed from Table A165 in Ronald
Inglehart et al. (2004), Human Beliefs and Values (Mexico: Siglo Veintiuno). The sample is fairly
representative.
Country
Lower
Higher
Argentina
86
79
Australia
69
50
Brazil
98
95
China
50
26
Denmark
44
17
U.S.
76
58
22. (20 points) Literacy is expressed as the percentage of people over 15
who can read and write. As a population becomes more literate, female
literacy (F) tends to remain lower than male literacy (M): F<M. When
M increases, F increases too: M up  F up.
263
APPENDIX B
a) When female literacy is graphed against male literacy, which
regions are conceptually allowed, and which are forbidden? (Draw a
graph, to show them.)
b) Suppose the following model is proposed: F=M-20. Show it on
your graph. (Do so fairly precisely, e.g., by calculating F for M=100
and M=50 and drawing a line through these points.)
c) This F=M-20 does satisfy the two conditions above (F<M, and M
up  F up). Still, why is it not acceptable, on logical grounds?
d) Suggest conceptual anchor points, which all logical models to
connect F and M must respect; show them on your graph.
e) Show a better model on your graph, with an equation to go with it.
[No need to calculate the numerical value(s) of constant(s)]
22A. (30 points) Literacy is expressed as the percentage of people over
15 who can read and write. As a population becomes more literate, the
following is most often observed. 1) Female literacy (F) tends to remain
lower than male literacy (M): F<M. 2) When M increases, F increases
too: M up  F up. Suppose we have the following data (in %), at
various times:
Time t1
t2
t3
M
25
30
70
F
6
9
49
Suppose someone has proposed the following model, to connect F and
M: F=M-20.
When faced with such a situation, the usual questions a researcher pose
to her/himself are:
Should I accept this proposal? If so, why? Or should I reject it? If so,
why?
Is this model the best one can do, or can I offer something better?
Do all you can with this information, using skills developed in this
course. Beyond responding to step-by-step commands, show that you
can set up such steps for yourself.
21. (27 points) Literacy percentages for females and males in three
developing countries are F=35, M=45; F=45, M=60; F=60, M=70.
Build a model, in as much detail as you can.
For Appendix A
24. (13 points) An output z is estimated from linear regression in x and
y: z  a+bx+cy. The mean values of the inputs are X and Y. The
impact of y is found to be statistically non-significant, so one wishes to
264
APPENDIX B
forget about y and estimate z from x alone: z  A+Bx. How are the new
constants A and B derived from the previous?
25. (13 points) An output z is estimated from linear regression in x and y:
z  0.68+0.56x+0.28y. The mean values of the inputs are X=2.1 and
Y=1.6. The impact of y is found to be statistically non-significant, so
one wishes to forget about y and estimate z from x alone: z  A+Bx.
What values should we plug in for A and B?
26. (16 points) Suppose you are given the following results of a multivariable linear regression of z on x and y:
x
0.97***
y
0.55
Intercept
0.20
R2
0.50
a) Write the equation by which we can estimate z from x and y.
b) Suppose the median value of x is 10.0 and the median value of y
is 0.90. Calculate an estimate for median z.
c) One of the inputs is statistically not significant. Omit it, and show
the equation that connects z to the remaining input.
d) Use it to estimate the median z. Does the result surprise you?
Why, or why not?
27. (20 points) Continuation of Question 10, which is reproduced here.
A country's Gross Domestic Product (G) can be expected to grow with
its population (P) and also with its average level of education (E),
measured as the number of school years for the median person. Suppose
the following model is proposed, for a not heavily overpopulated
country: G=aPE1/2
a) Does the proposed relationship between G and P make sense?
Why or why not? (Consider what would happen to G when P is
doubled.)
b) When the median person has at least 6 years of schooling, does
the proposed relationship between G and E make sense? Why or why
not? (Consider what would happen to G when E is doubled, and then
doubled again several times.)
c) Also consider what would happen in a country with absolutely no
schooling. Could you suggest some slight correction?
d) Rather than using this attempt at a logical model, suppose
someone decides to use multivariable linear regression to elucidate the
connection of G to P and E. How should he first transform the
variables? (Write out the equation to be used.)
265
APPENDIX B
e) Assuming that the outcome of such analysis is as follows.
x
y
Intercept
R2
0.97***
0.55
0.20
0.50,
(x pertaining to P and y to E). Write the corresponding equation for G
itself.
f) The resulting equation has some similarities to G=aPE1/2, as well
as differences. To what extent would it confirm or disconfirm the supposition G=aPE1/2?
266
APPENDIX C
An Alternative Introduction to Logical Models: Basic “Graphacy”
in Social Science Model Building
Basic “Graphacy” in Social Science Model Building
What is graphacy1?
In analogy with literacy (ability to read and write), the notion of
numeracy has been introduced for ability to handle numbers. I add
“graphacy”, for ability to construct and interpret graphs. Here I develop
graphacy skills for building logical models, using social science
examples.
I try to make do without any mathematics, or at least without any
equations. This is the challenge. Bright 13-year-olds should be able to
understand the first two chapters. All I expect is 1) knowledge of what
percentages are, and 2) ability to graph a set of data, y vs. x, on regular
scales. My experience is that not all college students possess the latter
skill, so my Logical Model Building and Basic Numeracy in Social
Sciences (Taagepera 2015) spends a long introductory chapter on its
pitfalls. Here I bypass it.
Forget about bar and pie charts. They do not connect. Science is
about ability to connect factors, and then to predict. So I focus on
graphing one factor (y) against another factor (x). My goal is to show
how to use such graphs, adding lots of good sense, so as to build
“quantitatively predictive” logical models.
Why do I see a need for the present text? Students who complain
that they are “not good at math” too often fail even before reaching any
math in the sense of arithmetic or algebraic equations. They fail at
visualization of a problem or issue. In final exams based on Taagepera
(2015), too many students who have mastered a fair amount of
mathematics make disastrous mistakes in non-math picturing of logical
constraints, mainly by not trying to picture them at all. This initial
failure can make what follows pointless, no matter how many
logarithms and exponents one piles up. Visualization seems to be a
major hurdle. So, such graphacy skills should be introduced at the very
beginning of a course on logical models, so as to give these skills time
to be digested and reinforced throughout the remaining course.
The first chapter introduces our amazing ability to deduce the broad
overall pattern of y vs. x at any values, on the basis of just logical
Appendix C
constraints involved plus one data point. The second chapter expands to
a number of most usual constraints. Here data points are fictitious (but
realistically chosen). The third chapter presents actual data, as published
but only incompletely analyzed (mostly by linear fit that risks producing
absurd value of y at some values of x). This chapter asks students to
establish the constraints involved and then draw the curve for the entire
possible range of x. A separate document offers my answers.
How do these graphacy chapters fit in with Logical Models and Basic
Numeracy in Social Sciences,
http://www.psych.ut.ee/stk/Beginners_Logical_Models.pdf
In pilot courses, 2016 and 2017, I intend to start off with Chapter 2 of
Logical Model Building and Basic Numeracy in Social Sciences, which
introduces ability to graph a set of data, y vs. x, on regular scales
(including the first Home Problem). I then intend to shift to the present
text. As we work out its Chapters 1 and 2 in class, Chapter 3 is assigned
as Home Problems, probably in two separate batches, Cases 1 to 5 and 6
to 11. Out of these 11, fully 8 occur in Logical Models and Basic
Numeracy – but here they come unencumbered with any equations.
Thus the basics of graphacy should be easier to grasp, and later we can
address the mathematical aspects separately.
A complementary skill, ability to “read” all features in a moderately
complex graph is not addressed here but rather in Logical Models and
Basic Numeracy (Chapter 10, section “What can we see in this
graph?”).
Once the pilot courses work out, I intend to include this text in
Logical Models and Basic Numeracy. However, this would mean a
major rewrite.
Rein Taagepera
University of California, Irvine and University of Tartu, Estonia
Johan Skytte Award 2008, IPSA Karl Deutsch Award 2016
268
Appendix C
1. Data: Do something with them. Using thinking
Suppose I offer you the following data point: when x=60%, y=36%. Say
something about this information. Do something with it.
Are you at a loss? What can one do with just one data point? If your
idea of science is limited (mistakenly) to computerized statistical
analysis, you can do little, indeed. But if we realize science is
interaction of data and thinking, then we can squeeze something out
even of this little.
1. One data point and abstract percentages
What can we do? Compare. Place this 60%;36% in a wider framework.
(Without comparisons, science could not even begin.) Still at a loss?
Think in a comparative way. What is usually the lowest value percentages could have? And the highest? These are useful comparison
points. Still at a loss?
OK, here’s what I am after. Percentages usually run from 0 to 100.
In this range, 60 is on the high side, closer to 100, while 36 is on the
low side, closer to 0. This is what I mean by “Compare and place in a
wider framework”. If you think this is too obvious to be mentioned, you
are on the wrong track. Science starts with making conscious note of the
“obvious”.
There is still more we can do. Again at a loss?
Always try to turn a problem into a picture. Still at a loss?
Well, we have some x and some y. Let us graph it. Graph all we know.
Maybe you drew the axes only from 0 to about 60 and then placed
the point 60;36 in this field:
If this is what you did, then you did not graph all we know. We also
know that the percentages usually go up to 100, and no more. So draw
the entire box, x going from 0 to 100, and y going from 0 to 100. Shade
in grey the area outside this box: This is the region where percentages
usually cannot occur. In contrast, inside this box, any point x;y may
occur, in principle.
269
Appendix C
We now can see the “conceptually allowed” region, and within it, our
single data point. It is now quite explicit where this data point is located,
compared to all the other locations where it could conceivably be.
If you now think I have dumbed it down beyond your dignity and
patience, with stuff at best marginal to science, please reconsider. If all
this is so self-evident that you might as well by-pass it, then why didn’t
you draw the box above on your own, the moment you were told, “Do
something with it”?
As for being marginal to scientific thinking, this is dead wrong. This
example is at the very center of scientific thinking, because it introduces
two basic habits to develop:
 Think about combining things directly given (data
point) with other knowledge you already have (percentages ranging from 0 to 100).
 Visualize – draw a picture – a picture of the entire
possible range.
2. One data point with substantive content
This is how far we can go without knowing, what x and y are about –
just abstract x and y. Now let us give them substance. This is where we
shift from mathematics to natural science, human nature included. I now
tell you we are dealing with male and female literacy rates in a country –
the proportions able to read and write. Make it appear in the notation:
M=60%, while F=36%. Such informative notation greatly helps
thinking, compared to bland x and y. Now we can go much further with
these data plus thinking.
The fundamental question that we should ask at every turn is:
 But can this be so? Does it make sense?
Apply it to M=60%, F=36%. Are you at a loss? Compare.
M is larger than F. Does this make sense? If we were given the
reverse (M=36%, F=60%.), would you sense something odd? You
should. When literacy is low, it always is even lower for females than
for males. This is a hard fact. So M larger than F makes historical sense.
270
Appendix C
Historical. Think about this word. What could the literacy rates have
been earlier, and what could they be later on? Try offer some estimates,
and place them on the graph F vs. M. How on the earth could we know?
For recent past and close future, this would be guesswork, indeed, at
this stage. But, counter-intuitively, this task becomes quite precise for
much more distant past and future. How could this be?
Way back, there must have been a time when no male in this country
could read: M=0%. What could F have been at that time? No escape: It
must have been F=0%. When was that? Never mind the date. (In
science we must learn to overlook what is not needed.) But whenever it
was that M was 0, F must have been 0 too (unless one builds a very
fancy story). Place this point on the graph.
Now think ahead. Sooner or later, all countries reach practically
complete literacy: M=100% and F=100%. Place this point, too, on the
graph:
Such points are “conceptual anchor points”, because they anchor the
entire possible path literacy can take, from no one literate to everyone
literate. Anchor points are “experimental data points” in a special sense:
They result from thought experiments. Hold it, some data-oriented
people may scream: Science deals with reality, not hypotheticals. They
are dead wrong. Thought experiments are very much part of science.
Our anchor points are even stronger data points than 60;36, in the
following sense. The 60;36 is subject to random error. It could have
been 61;35, for all we know. But 0;0 has no such leeway. This is a
perfect 0;0. (The 100;100 could technically stop at something like
99.7;99.8. Learn when it makes sense to overlook such quibbles.)
Now our graph no longer has just a single data point. It has 3 points.
“In the absence of any further knowledge” (another important term in
science thinking), our best bet would be to join them with a smooth
curve. (If we make it fancier, we must offer reasons why the path must
deviate from the simplest path possible.)
271
Appendix C
Chances are that the curve you have drawn suggests that M=50% would
correspond to F around 25%. In reverse, F=50% would correspond to M
around 70%. Does it, approximately?
In sum, for any given male literacy rate we can predict what the
female rate would be. And vice versa. This is a main goal of science:
ability to predict, with some degree of precision.
We could further add the “equality line”, F=M. The distance
between this diagonal and the actual curve shows how much female
literacy trails the male. As overall literacy increases, this gap (M-F)
widens until M reaches 50%. Thereafter, the gap begins to decrease.
Most countries actually do follow such a pattern, as literacy develops.
We could now go mathematical and develop the equation that pretty
much fits the curve you have drawn, but we will not do so at this stage.
(This equation is F/100=(M/100)2, if you insist on knowing.) The main
point is that a single data point could lead to this ability to predict
female literacy rate from the male, and vice versa. This is pretty powerful! Such predictive ability becomes possible, once one trains oneself to
practice thinking, using fairly broad guidelines. To wit:









“But can this be so?” Does it make sense?
Use informative notation (M and F rather than x and y).
Think about extremes (like 0 and 100%).
Compare (also with extremes).
Graph the data.
Graph more than the data -- show the entire conceptually
allowed area where data points could possibly occur.
Consider conceptual anchor points.
Connect anchor points and data points with a smooth curve.
Sometimes a diagonal also makes sense, as a benchmark for
comparisons.
272
Appendix C
3. More data points: thinking vs. statistical analysis
Now assume more data points are given -- like 60;36, and 50;25, and
70;50. For the thinking approach, this adds confirmation that the single
point 60;36 was not a fluke or result of erroneous measurement. All
three data points visibly are close to the same smooth curve joining the
anchor points. This is a pattern that makes sense, going from utter
illiteracy to complete literacy. For instance, when M=20%, F is around
4% (the small “+” in graph below).
But now we also have enough points to apply statistical analysis. This
leads to temptation to spare us thinking. Let the computer do this dirty
work of thinking! (This means we stop being scientists but still hope to
get results that qualify as science.) Enter the data in the computer, push
the “linear regression” button, and we get a printout with lots of
impressive-looking numbers.
We might figure out that some numbers in the printout mean that the
best straight-line fit is F=-41+1.3M. But we also are offered much more.
The printout also shows “R2“. Its value is around 0.90, which means that
the line obtained is a pretty good fit -- data points are scattered only
slightly around this line. The printout may also have p-values, stars next
to some coefficients, and maybe even more.
This may look impressive. But what about ability to predict, this
major goal of science? From your computer printout, can you determine, say, what the female literacy rate would be when the male one is
20%? Or what the male literacy rate would be when the female reaches
100%? If you cannot do that, you can publish your printout and babble
about R-s and stars, but your publication would fall short of what is of
most interest in science.
Suppose you can hack it through the printout maze and deduce from
it the line F=-41+1.3M. This does enable us to deduce F from M, and
vice versa. What would female literacy rate be when the male one is
20%? For M=20%, the computer equation predicts F=-15%. A negative
273
Appendix C
female literacy rate? Does that make sense? Of course not. Something is
badly off the track.
What would male literacy rate be when the female reaches 100%?
The computer equation predicts M=108.4%. More than 100% of the
males being literate? Come now!
What’s happening? To make it visible, let us graph the computersupplied results and compare the outcome with our previous one:
Yes, this is what one gets, when one does not think and does not graph,
before rushing to the computer. Good data in, junk out. The computer
ignored the conceptual anchor points. Why? Because no one told the
computer that such anchor points exist. Do not blindly yield to the
computer printout. Ask “Does it make sense?” If it doesn’t, do not
accept it. Say: I must have goofed somehow.
Indeed, nothing is wrong with computer and its linear regression
program. Blame the person who misuses them. Before feeding the data
into the computer, we must think about how to do it so that it makes
sense. This takes some mathematics, which we avoid here. (If you insist
on knowing: we must feed into linear regression the logarithms of M
and F, not M and F themselves.) For the moment, let us get more
practice with what is the most important – drawing pictures to express
various situations. And this does not need mathematics! – at least not
what is usually viewed as mathematics.
274
Appendix C
2. Some usual patterns of data subject to constraints
Two factors can interact in a multitude of ways. The processes of interaction can be varied and complex. At times x may influence y, while at
other times y may influence x. Or both x and y may be influenced by a
hidden third factor z. Constraints such as anchor points may still impose
a similar patterns. We’ll cover a number of them.
Do not try to memorize them. When approaching a new problem, do
not try to go through the list, trying to guess which one might fit. None
might. Try graphing the constraints the factors are subject to, and only
one option may emerge.
1. Box with two anchor points
The pattern of male and female literacy rates repeats itself whenever
two factors must be 0 jointly and also must reach their maximums
jointly. One can always express their values as percentages of their
respective maximums. Then 0;0 and 100;100 are anchor points. Thus,
similar curves arise when a few data points are given. (Somewhat
different shapes can still materialize, but let us not worry about this
right now.)
Take popular approval ratings of head of state and legislative
assembly. Typically, people rate heads of state higher than assemblies.
(They yearn for father figures.) Then, when even the head of state’s
approval rate drops to 0%, we can expect no higher rate for the
assembly. This leads to anchor point 0;0. Conversely, if even the
assembly rates 100%, we can expect no lower for the head of state. This
leads to anchor point 100;100. If we have a single data point for a
country, say head of state approval H=60%, assembly approval A=36%,
we can already guess at how one of them might change when the other
changes.
Now suppose we have data for 3 countries across the world, one data
point for each. I use three different symbols, so as to drive in that these
are different countries:
275
Appendix C
What could be our guess for the overall pattern for the world? Some
students have too much respect for data and try to pass a wiggly curve
through all the points, sometimes even neglecting to connect to anchor
point 100;100:
Don’t do it! Different countries may be on different curves (each of
them respecting the anchor points), some closer to the diagonal, A=H,
some further away:
(Mathematically, if you insist on knowing, all these curves are
A/100=(H/100)k, but the value of k is different.) Our best guess is that
the world average curve might be around the central curve. With so few
cases, we cannot be sure. But with the help of logical anchor points, we
can do much better than say “I don’t know”.
276
Appendix C
Here’s another example. Trust in people varies enormously from
country to country. In some, only 10% say people usually can be
trusted; in some others 60% do. But within the same country, the
average trust level of people with higher education (H) always is higher
than the average trust level of those with primary education (P). Anchor
points 0;0 and 100;100 still apply (if trust is in percent). When we graph
P vs. H for all countries in the world (not shown here), we get a single
curve with remarkably little scatter.
Further examples could be piled up, but let us stop before you start
thinking that a box with two anchor points is all the wisdom you need.
2. Box with three anchor points
Have a country with only two parties, where assembly seats are allocated “by plurality” in single seat districts. This process works out in
such a way that the party that gets a larger share of votes (V) wins an
even larger share of seats (S). Typically, when a party has 60% of the
votes, it receives 77% of the seats.
If this party has 100% votes, it obviously wins 100% seats. And if it
has 0 votes, it also has 0 seats. This looks like another situation with
two anchor points. But hold it. What if the votes are 50–50? In the
absence of any other information (the famous expression), we cannot
favor either party. So the seats must also go 50-50. We have a third
anchor point in the center. Along with our single data point, 60;77, the
picture is
How should we draw a curve to join the 3 anchor points and the data
point? Actually, we have a hidden second data point. How come? Recall
that there are only two parties. Think: What does it imply for the second
one, when the first one is 60;77? Enter this extra point to the graph. Join
these 5 points with a smooth curve:
277
Appendix C
We have little leeway. The simplest curve we can draw through these
points is bound to a bit more complex than in earlier cases. This is very
much the pattern UK used to have when it had a pretty clear two-party
constellation. It no longer has. (The equation for this type of curve is
S/(100-S)=[V/(100-V)]k. Here k=3. This is the slope at V=50.)
Now suppose we have a direct presidential election. Whichever
candidate has even a slight edge wins the entire single seat at stake.
Here 51% votes leads to 100% seats. How does the graph look now?
How can we join these points? This is still the same picture of 3 anchor
points, except that the smooth bend must become a sharp kink. What
would happen if both candidates had exactly the same number of votes?
One would have to throw a coin, and each candidate would have a 50–
50 chance. So the third anchor point at 50–50 still applies. (This
equation for this curve still is S/(100-F)=(V/(100-V)k, but here k tends
toward infinity.)
278
Appendix C
In the other direction, if we had a huge number of seats, the curve
would become flatter. It would approach the diagonal, S=V. (For this
curve k=1, in S/(100-S)=(V/(100-M)k.) So the number of seats at stake
matters. Within the same “family of curves”, a larger total number of
seats makes the curve flatter, while a smaller number makes it kinkier.
4. Between floor and ceiling
When economy expands (per capita Gross Domestic Product goes up),
more people tend to be satisfied with the situation. When economy
contracts, more people tend to be unhappy. Suppose that 40% of people
are happy when economy is stable (per capita GDP goes neither up nor
down). Suppose that 60% of people are happy when economy expands
by 4% per year. Make a picture of Satisfaction (S) vs. Growth in GDP
(G). Include the extreme possibilities. We first get this:
We have just two data points, sandwiched between a floor of 0% satisfaction and a ceiling of 100% satisfaction. How are we to draw a curve
that gives us a value of S for every value of G?
The allowed region is a zone between S=0 and S=100%, at any value
of G. When growth is very positive, satisfaction is likely to rise toward
100%, but without ever fully reaching it. When growth is very negative
(economy shrinks sharply), satisfaction is likely to drop toward 0%, but
again without ever fully reaching it. Symbolize this approach to a limit
as slightly bent curves very close to floor at left and to ceiling at right.
They occur at very large increase or decrease in GDP:
279
Appendix C
Now it becomes a matter of joining the two data points in a way that
respects the conceptual limits. Try to make it as smooth and symmetrical as possible:
This curve is often characterized as a ”drawn-out S”. (The technical
term for the simplest drawn-out S curve is “simple logistic”. Its equation is G=100/(1+e-k(G-G’)), where G’ is the value of G where S is 50. The
k reflects steepness at G=50.) Note that our previous 3-anchor-point
curve could also be called drawn-out S. But there the limits on horizontal axis are 0 and 100, while here yearly change in GDP could in
principle range from hugely negative to hugely positive. (Technically, it
could range from minus 100 to plus infinity.)
Anchor points and approaches to limits are the two major forms of
constraints we encounter. Anchor points are easier to visualize: a specific value of y at a specific value of x. Approaches to limits are more
awkward: a specific value of y when x “tends toward” minus or plus
infinity. Infinity is awkward. Try to get used to it.
With two anchor points, a single data point could give us some idea
about the shape of the curve. But with two approaches to limits, as is the
case here, we need at least two data points. Otherwise the steepness of
the curve near the center could not be estimated. To have a fair estimate,
these two data points should be far from the limits and far from each
other. Our two data points, at G=40% and G=60%, satisfy this.
The drawn-out-S curves are most usual when sizes change over time.
Time ranges from minus or plus infinity. Take a bacteria colony that
starts with a single bacterium and eventually fills up the entire space
available. Its size increases along the drawn-out-S curve. Or take a
small human group landing on a previously uninhabited island and
multiplying, until arable land runs out and sets a limit to population. Or
take an idea or a product that initially meets resistance but then “takes
off”, until meeting some sort of a limit. These curves are not always
simple and symmetric, but the broad common feature is slow takeoff
from a floor and, later, a slow approach to a ceiling.
280
Appendix C
5. From anchor point to ceiling
Volatility of voters means their readiness to switch parties from one
election to the next. The more parties there are, the more there are ways
to switch. So volatility (V) should increase with increasing number of
parties (N). Suppose volatility is 24% when 3 roughly equal parties are
running. What could it be when only 2 roughly equal parties are
running? If you guessed at 2/3 of 24%, 18%, you overestimated a bit.
Better ask again: What are the limits on N and V? And are there any
anchor points or approaches to a limit?
The allowed region is a box open to the right. The number of parties
must be at least 1, but could be huge, in principle. Volatility can range
from 0 to 100%.
If only the same one party runs in both elections, then party
switching is impossible. Thus, we have an anchor point: N=1 imposes
V=0. When a huge number of parties run, volatility might approach
100% but not exceed it. In the absence of any other knowledge, take
V=100% as the ceiling, approached as N tends toward infinity. The
picture now is:
Smoothly joining the anchor point, data point, and the approach to the
limit gives us
281
Appendix C
So, with two parties, a volatility of about 12% could be expected. Note
that in this case we again could make an estimate on the basis of a
single data point. (The equation for this type of curve is V=100(1-ek(N-1)).
Here k=0.14.)
You may be unhappy about many details. How can we draw a
continuous curve, when parties come in integer numbers? We actually
use an “effective number” that weights parties according to their size. If
the vote shares are 60, 30 and 10%, this effective number is 2.17 parties.
(The formula is N=10,000/SUMvi2.)
We make many simplifying assumptions. The data point 3;24 is
typical of state elections in India, the best database available.
6. Constant doubling times
We have dealt with anchor points and approaches to limits. Can there be
utterly different kinds of constraints, to determine a pattern? Certainly.
A most usual one is constant doubling times. Young living organisms
have this urge to double in size at constant intervals. In the opposite
direction, a chunk of pure radioactive material looses one-half of its
weight at constant half times.
Have two data points for a sum deposited at fixed compound
interest: 1000 dollars right now (at time 0) and 2000 dollars 15 years
from now (t=15). (This implies a compound interest rate of 4.73% per
year.) What is the doubling time in this case? ... years, of course. With
the constraint of constant doubling times, this imposes 4000 dollars at
30 years from now – and in reverse, 500 dollars 15 years ago. And so on
for +45 and –30 years. The graph shows these deduced points as “+”:
282
Appendix C
Joining these points, we get a smooth curve that goes up ever faster in
the future and approaches 0 dollars at times far past. (Thus, the constraint of approaching a floor at 0 is built in when we assume constant
doubling times.)
Now the amount at any time can be read off from the curve. For
instance, 10 years from now, the amount would be around 1590 dollars.
Constant doubling times are the very definition of “exponential”
growth. (The equation for this pattern is S=S0ekt. For our example it is
S=1000e0.0473t.)
7. Overview of frequent combinations of constraints
The basic simple combinations are:
 two anchor points – at least one data point needed to place the
curve;
 one anchor point plus one approach to a limit – at least one data
point needed ;
 approaches to two limits, upper and lower – at least two data
points needed to place the curve;
 constant doubling times – at least two data points needed to
place the curve.
Many other combinations can occur. The only extra one we dealt with
was 3 anchor points. Note that we found hardly any cases where a
straight line would do – just the diagonal of the square box, y=x.
Straight-line relationships are rare indeed.
283
Appendix C
3. Fitting actual data with logically grounded curves
Up to now, we have seen how little data we may need to make a pretty
good guess at what the overall pattern might be – once logical anchor
points and approaches to limits are taken into count. Actual data may
include a profusion of data points. They usually form a “data cloud”,
which may look straight or like a bent sausage or something more
outlandish. Some deviant points may occur far away from the dense
data cloud. Take the data presented at face value, without arguing about
the ways they were obtained. Our focus is on how to analyze valid data.
The game is still the same. Establish logical anchor points and approaches to limits. An equality line may offer perspective, if such
equality line makes sense. (Percentages can be equal to other percentages, but areas cannot be equal to populations, for instance.) Published data graphs may involve frames and lines that may confuse rather
than help you.
Here are eleven challenges. Do not expect all of them to fit exactly
with patterns previously presented. We always have to think. My
suggested answers are to be found in a separate document, “Solutions
for problems in Chapter 3”. If if they should be available, do not expect
any increase whatsoever in your science skills, if you just read the
questions and then peak at the answers. You have to sweat it through,
make mistakes, and learn from your mistakes. We cannot learn from
mistakes we were too lazy to make. If you get completely stuck with
some of the cases, go back to Chapter 2 and study it further. There is no
cheap way out, if you want to acquire science thinking,
CASE 1. How are ratings for capitalism and democracy related?
At least in the North Pacific, support for capitalism (call it C) and
support for democracy (D) seem to go somewhat hand-in-hand, but with
some scatter, as shown in graph below (Dalton and Shin 2006: 250).
284
Appendix C
What is the overall pattern? Copy the graph and insert additions needed.
Do not be confused by the box shown -- it has no meaning. Think of
allowed and forbidden regions, logical anchor points, approaches to
limits, and equality lines. Do not expect that all of these enter in every
instance.
In science, we need precision. Lines drawn freehand are not precise
enough. Use a straightedge. A book edge might do, but a transparent
straightedge makes life easier.
When fitting with a smooth curve, try to make it so that roughly half
the data points are above the curve and half are below. But keep the
curve smooth.
CASE 2. When you lose ground, do you lose everywhere?
The US Democratic Party did pretty well in 1999. In the next elections,
2000, they did worse. Did they do worse in every US state? The graph
below (Johnston, Hagen and Jamieson 2004: 50) shows the Democrat
vote shares in 2000 (call it S, for second election) plotted against the
Democrat vote shares in 1996 (call it F, for first election). The support
figures in individual states range from below 30 to close to 70%.
285
Appendix C
What is the overall pattern? Copy the graph and insert the additions
needed. Do not be confused by the box and lines shown – they have no
meaning or have a questionable meaning. Think of allowed and forbidden regions, logical anchor points, approaches to limits, and equality
lines. Do not expect that all of these enter in every instance.
When a scale must be extended, what is the simplest way to go?
Place the edge of a sheet of paper at the scale. Copy the scale on this
edge of sheet. Then shift the sheet left or right and mark further scale
marks on the original graph.
CASE 3. How are support for gender equality and tolerance of
homosexuality related?
Average degree of support for gender equality (call it G) and tolerance
for homosexuality (call it H) seem to go somewhat hand-in-hand, as
shown in graph below (Norris and Inglehart 2003: 27). But scatter is
wide.
286
Appendix C
What is the overall pattern? Copy the graph and insert the additions
needed. Neglect the blatant deviants and focus on the dense data cloud.
Do not be confused by the box and line shown – they have no meaning
or have a questionable meaning. Think of allowed and forbidden
regions, logical anchor points, approaches to limits, and equality lines.
Do not expect that all of these enter in every instance. When fitting with
a smooth curve, try to make it so that roughly half the data points are
above the curve and half are below. But keep the curve smooth.
Note that Gender Equality scale runs from 0 to 100, but Approval of
Homosexuality scale runs from 0 to 10. Thus, “10” means full tolerance. To convert to percent, what do we have to do?
CASE 4. How number of parties affects voters’ volatility
The graph below shows that voters’ volatility in Indian states, from one
election to the next, tends to increase when they can choose from among
more parties (adjusted from Heath 2005). But scatter is wide.
287
Appendix C
What is the overall pattern? Copy the graph and insert the additions
needed. Think of allowed and forbidden regions, logical anchor points,
approaches to limits, and equality lines. Do not expect that all of these
enter in every instance. When fitting with a smooth curve, try to make it
so that roughly half the data points are above the curve and half are
below.
CASE 5. How does bicameralism connect with federalism?
Federalism means that a country is divided into “provinces”(Canada) or
“states” (US) with more or less autonomy. Strength of federalism (call it
F) expresses how autonomous they are. Bicameralism means that the
national assembly has two separate “chambers”, with different powers.
The first or only chamber is usually elected by “one person, one vote”.
The second chamber may represent the federal units, or ancient
aristocracy (UK), or whatever. Its strength (call it B) may range from
nothing to being equal to the strength of the first chamber.
Full-fledged federalism should go with two equally powerful chambers,
so that both population and federal subunits can be represented on an
equal basis. If a completely unitary country has a second chamber, it
might be weak. So there may be some connection between B and F.
The graph Lijphart 1999: 214) shows limited connection.
288
Appendix C
This is a fuzzier case than the previous. Rather than measured
quantities, Lijphart uses here informed estimates. He does so on a 1 to 5
point scale for F and on a 1 to 4 point scale for B. For federalism, F=1
means a purely unitary country, and F=5 means that the provinces or
states are as strong as they ever get. For bicameralism, B=1 means only
one chamber, while B=4 means a second chamber as powerful as the
first chamber.
One confusing feature here is that this graph rates “nothing” as 1
rather than 0. Tough luck, but we sometimes encounter data in such a
form. We may wish to re-label the scales, so that 1 becomes 0% and
F=5 and B=4 become 100%. Then F=2 becomes F’=25%; F=3 becomes
F’=50%; and F=4 becomes F’=75%. And B=2 becomes B’=33.3, and
B=3 becomes B’=66.7. It would help to show this new scale in your
graph.
What is the average overall pattern? Copy the graph and insert the
additions needed. Do not be confused by the box and line shown -- they
have no meaning or have a questionable meaning.
Hints: One might expect that strong federal subunits would go with a
strong second chamber to defend their interests. So F=5, B=4 looks like
a logical anchor point -- and quite a few countries inhabit this location.
The case for F=1, B=1 is flimsier, as nothing prevents unitary countries
from having second chambers. But quite a few countries inhabit this
location (F=1, B=1), so give it a try, as anchor point. What would be the
best simple curve joining 1;1 and 5;4? (Even the best one would still be
a poor fit for all too many countries. But science is about doing the most
about whatever we have, rather than giving up.)
CASE 6. How does strength of judicial review of laws relate to
constitutional rigidity?
Constitutional rigidity means that the constitution is hard to change.
Judicial review means that courts can pass judgment on whether a law
conforms to the constitution. Both show that the constitution is taken
seriously, so they might go hand-in-hand. But the graph below (Lijphart
1999: 229) shows only faint connection.
What is the average overall pattern? Copy the graph and insert the
additions needed. Do not be confused by the box and line shown -- they
have no meaning or have a questionable meaning.
We again face the difficulty is that “nothing” is rated as 1 rather than
0. We may wish to re-label the scales, so that 1 becomes 0% and 4
289
Appendix C
becomes 100%. Then 2 becomes 33.3 and 3 becomes 66.7. Tough luck,
but we sometimes encounter data in such a form.
This is an even fuzzier case than the previous. Lijphart again uses
informed estimates. He does so on a 1 to 4 point scale for both factors.
For rigidity, C=1 means no written constitution at all (as in UK), and
C=4 means that the constitution is extremely hard to modify. For
judicial review, J=1 means no judicial review, and J=4 means a very
stringent review. One cannot declare a law unconstitutional when there
is no constitution! Thus C=1 should impose J=1. This looks like a
logical anchor point. But the graph shows that Iceland still manages to
have some judicial review of constitutionality of laws despite of not
having a written constitution.
We also might set up as an ideal extreme that J=4 goes with C=4, even
while India falls short on C, and Switzerland falls utterly short on J.
What would be the best simple curve joining 1;1 and 4;4? (Even the
best one would still be a poor fit for all too many countries. But science
is about doing the most about whatever we have, rather than giving up.)
290
Appendix C
CASE 7. How do interest groups and political parties compensate for
each other?
Some countries have many separate interest groups, such as separate
trade unions and business associations. In some others they congregate
under a few roof organizations, such a single nationwide association of
employers and a single trade union for workers. Lijphart (1999: 183)
makes use of a compound index of interest group pluralism (I). How
this index is formed need not concern us here. Just accept that I=0
stands for as few separate interest groups as there can be, while I=4
stands for as many separate groups as there can be.
Some countries have many political parties, while others have few.
The aforementioned effective number of parliamentary parties (Example 5, volatility) is a fair measure. One might think that more parties
would go with more interest groups. But the graph below (Lijphart
1999: 183) shows the reverse!
Yes, surprisingly, countries that have many parties tend to have low
interest group pluralism, and vice versa. It looks as if there were some
complementarity. But what is the form of this relationship?
Copy the graph and insert the additions needed. Do not be confused by
the box and line shown -- parts of this have no meaning or have a
291
Appendix C
questionable meaning. Think of allowed and forbidden regions, logical
anchor points, approaches to limits, and equality lines. Do not expect
that all of these enter in every instance. Here the fact that index I runs
from 0 to 4 rather than from 0 to 100% presents no special difficulty, as
long as we keep it in mind that I=4 is an upper limit, by definition.
CASE 8. How satisfaction with economy changes with
growth/decrease in GDP
When Gross Domestic Product increases, more people are likely to be
satisfied with economy, and vice versa. The graph below (Lewis-Beck
et al., Electoral Studies 32: 524–528, 2013) confirms it. This graph
shows on the y-axis the percentage (W) of people polled who felt that
the economic situation had worsened versus, on the x-axis, the
percentage (G) by which GDP had increased or shrunk, compared to
previous year. The data points are labeled with years in which they
occurred; pay no attention to these dates but just to their locations on the
graph.
What is the average overall pattern? Copy the graph and insert the
additions needed. Do not be confused by the lines shown -- they may
have no meaning or have a questionable meaning. Think of allowed and
forbidden regions, logical anchor points, approaches to limits, and
equality lines. Do not expect that all of these enter in every instance.
292
Appendix C
CASE 9. How do representatives reflect the ideology of their
constituents?
Assume one-seat electoral districts. One might well guess that the
representative of a very conservative district would vote accordingly in
the assembly, and vice versa for a very non-conservative district. (The
latter would be called leftist in most of the world but “liberal” in the
US). But what about a district where conservatives barely outweigh
non-conservatives? Would their conservative representative take into
account her non-conservative constituents and vote sort of 50-50 on
issues in the assembly? Or would she vote all-out conservative as if she
represented only those constituents who voted for her? The graph
(Dalton 2006: 231) shows a mixed picture.
What is the average overall pattern? Copy the graph and insert the
additions needed. Do not be confused by the lines shown – their
location has no meaning. Think of allowed and forbidden regions,
logical anchor points, approaches to limits, and equality lines. Do not
expect that all of these enter in every instance. Both district and
representative conservatism can in principle range from 0 to 1 (0 to
100%).
293
Appendix C
CASE 10. How does support for redistribution affect actual
redistribution?
All developed democracies tend to redistribute some income from
wealthier to poorer people, through taxation. (Tax havens, indirect taxes
and access to policymaking often mean, though, that wealthier people
actually pay fewer taxes than the poor.) Some people like it, some
don’t. One may well presume that the extent of redistribution is higher
when more people support it. The graph below (Lupu and Pontusson
2011) shows the extent of redistribution (call it R) plotted against the
support for redistribution (S).
What is the average overall pattern? Copy the graph and insert the
additions needed. Do not be confused by the box and line shown – they
have no meaning or have a questionable meaning. Think of allowed and
forbidden regions, logical anchor points, approaches to limits, and
equality lines. Do not expect that all of these enter in every instance.
This is a tricky case. It would seem that both S and R could in
principle range from 0 to 100 %. This is so indeed for support. But
redistribution could come close to 100% only if one person owned all
the wealth and nearly all of it were redistributed to all the rest. We are
not yet there. The graph above suggests that R reaches a ceiling around
32 % even when support for it approaches 100%, for the average
country. So assume an anchor point at S=100%, R=32%.
294
Appendix C
CASE 11. How do more parties reduce the largest seat share in
government?
As the number of parties increases, those extra parties may well whittle
down the shares of the largest parties – shares of their seats in the
assembly and also of their ministerial seats, if they form the government. So we might expect a decreasing pattern. The graph below
(Electoral Studies 37: 7, 2015) confirms it.
What is the average overall pattern? Copy the graph and insert the additions needed. Do not be confused by the lines shown – they have no
meaning or have a questionable meaning. Think of allowed and
forbidden regions, logical anchor points, approaches to limits, and
equality lines. Do not expect that all of these enter in every instance.
Call the effective number of parties N. Call the seat share of the largest
party in the government L.
295
References
Anscombe, Francis J. (1973) Graphs in statistical analysis. The American
Statistician 27: 17–21.
Comte, Auguste (18xx) Plan of Scientific Studies Necessary for Reorganization
of Society.
Dalton, Russell J. (1988) Citizen Politics in Western Democracies. Chatham,
NJ: Chatham House.
Dalton, Russell J. (2006) Citizen Politics. Washington, DC: CQ Press.
Dalton, Russell J. and Shin, Doh Chull (2006) Citizens, Democracy, and Markets
Around the Pacific Rim: Congruence Theory and Political Culture. Oxford:
Oxford University Press.
Hawking, Stephen (2010) The Grand Design.
Heath, Oliver (2005) Party systems, political cleavages and electoral volatility
in India: A state-wise analysis, 1998-1999. Electoral Studies 24: 177-99.
Johnston, Richard, Hagen, Michael G. and Jamieson, Kathleen H. (2004) The
2000 Presidential Election and the Foundations of Party Politics.
Cambridge: Cambridge University Press.
Kvålseth, Tarald O. (1985) Cautionary note about R2. The American Statistician 39: 279–85.
Lijphart, Arend (1984) Democracies: Patterns of Majoritarianism and Consensus Government. New Haven, CT: Yale University Press.
Lijphart, Arend (1994) Electoral Systems and Party Systems. Oxford: Oxford
University Press.
Lijphart, Arend (1999) Patterns of Democracy: Government Forms and
Performance in Thirty-Six Countries. New Haven, CT: Yale University
Press.
McCloskey, ….. (2009) The cult of statistical significance.
<http://www.amazon.com/Cult-Statistical-Significance-EconomicsCognition/dp/0472050079/ref=sr_1_1?ie=UTF8&s=books&qid=
1265515998&sr=8-1>
Mickiewicz, Ellen (1973) Handbook of Soviet Social Science Data. New York:
Free Press
Norris, Pippa, and Inglehart, Ronald (2003) Islamic Culture and Democracy:
Testing the “Clash of Civilizations” Thesis, pp. 5-33 in Ronald Inglehart,
editor, Human Values and Social Change. Leiden & Boston: Brill.
Rousseau, Jean-Jacques (1762). Le contrat social. English translation by M
Cranston: The Social Contract, London: Penguin Books, 1968.
Stein, James D. (2008) How Math Explains the World. New York, NY:
HarperCollins/Smithsonian Books.
Taagepera, Rein (1976) Why the trade/GNP ratio decreases with country size,
Social Science Research 5: 385–404.
Taagepera, Rein (1979). People, skills and resources: An interaction model for
world population growth, Technological Forecasting and Social Change
13: 13–30
References
Taagepera, Rein (1997) Expansion and contraction patterns of large polities:
Context for Russia, International Studies Quarterly 41: 475–504.
Taagepera, Rein (1999) The Finno-Ugric republics and the Russian state.
London: Hurst.
Taagepera, Rein (2007) Predicting Party Sizes: The Logic of Simple Electoral
Systems. Oxford: Oxford University Press.
Taagepera, Rein (2008) Making Social Sciences More Scientific: The Need for
Predictive Models. Oxford: Oxford University Press.
Taagepera, Rein (2010) Adding Meaning to Regression. European Political
Science 10: 73–85.
Taagepera, R. (2014). “A world population growth model: Interaction with
Earth’s carrying capacity and technology in limited space”, Technological
Forecasting and Social Change 82, 34–41.
Taagepera, Rein and Sikk, Allan (2007) Institutional determinants of mean
cabinet duration. Early manuscript for Taagepera and Sikk (2010).
Taagepera, Rein and Sikk, Allan (2010) Parsimonious model for predicting
mean cabinet duration on the basis of electoral system. Party Politics 16:
261–81.
Taagepera, Rein and Hayes, James P. (1977) How trade/GNP ratio decreases
with country size, Social Science Research 6: 108–32.
Thorlakson, Lori (2007) An institutional explanation of party system congruence: Evidence from six federations. European Journal of Political
Research 46: 69–95.
297
Download