Uploaded by Haining Zhou

chap5

advertisement
5
Capacity of Machine Learning
Models
In the previous chapter we have investigated how to estimate the information content
of the training data in bits. Since the purpose of a model is to generalize the experimental results into a rule that can predict future experiments, most people would
intuitively agree that a model should not be more complex than the training data. We
introduced this intuition in Chapter 2, especially Section 2.2.1. More formally, the
information content of the training data should give an upper limit for the complexity
of the model. But how does one measure the complexity of a model? To measure the
complexity of the model, we need the notion of model capacity, which is the topic of
this chapter.
Intuitively, the word capacity describes how much something can contain. For example, the capacity of an elevator describes how many people can ride in it. In general, the physical capacity of a container describes how much content can be stored
in it. The capacity of a container is often considered equivalent to it’s volume (the actual capacity, however, must always be a bit smaller than the volume of the container
as we have to account for technicalities like the walls of the container). In computer
science, memory capacity describes how much information can be stored in some
information storage, for example RAM or a hard disk. As discussed in the previous
chapter, volumes of data are measured in bits and, for practical reasons, in bytes, kilobytes, megabytes, gigabytes, terabytes, petabytes etc., whereby 1 byte equals 8 bits.
As explained in Section 4.4, the complexity of our training data is measured as the
mutual information between the input (experimental factors) and the output (experimental results). This is because the training data describes a function from the input
to the output. Therefore, the capacity of a model describes the maximum complexity
of a function that can be stored in a model. Formally, it is defined as follows.
definition 5.1 (Information Capacity) C = sup I(X; Y)
where C is the capacity measured in bits and I(X; Y) is the mutual information as
defined in Definition 4.13.
The supremum (abbreviated sup) of a subset S of a partially ordered set P is the
least element in P that is greater than or equal to each element of S , if such an element
exists. Consequently, the supremum is also referred to as the least upper bound. Informally, the capacity is therefore the least upper bound of the set of mutual information
of all possible functions that could be implemented from X to Y.
This notion can be counter-intuitive at first and, in fact, it has been a topic of
44
Capacity of Machine Learning Models
discussion and challenges for decades. For this reason, this chapter contains the optional Section 5.3. It provides a historic perspective on complexity measurements of
functions that does not contribute to the main thread of this book but is intended to
promote big-picture understanding of complexity.
5.1
Intellectual Capacity
Just as we did in Chapter 2, let us understand how capacity is modelled in humans first
before we apply the metrics to machine-generated models. The prominent example of
a first study conducted to see what humans are able to learn was conducted by Binet
and Simon in 1904 (Binet & Simon 1904). They looked at what children of di↵erent
social and ethnic status can learn and ultimately came up with our currently well
known measure of Intelligence Quotient or (IQ). Their study was so impactful that,
as of writing this book, the standard IQ test is still called the Stanford-Binet test. Their
definition of intelligence is short and therefore in the author’s opinion too general but
it serves as a very valuable starting point. Binet and Simon defined intelligence as
“the ability to adapt”.
The problem with this definition is that there are many things that seem to be able to
adapt without being intelligent. For example, when one puts a foot into sand, the sand
adapts around it to form a footprint. Nobody would think sand is intelligent. Instead,
the ability of the sand to adapt around the foot is explained by Isaac Newton’s “actio
est reactio” (action is followed by reaction) principle ((?)). Since this principle is
universal, the entire universe would be intelligent as per Binet and Simon’s definition.
The definition worked for them, as they limited the work space of this definition to
children learning in school.
For our purposes, we should be able to distinguish smart phones from regular
phones or a primate’s brain from a spider’s ganglion. This requires to be more specific
about the adoption part. What makes a smart phone more intelligent than a regular
phone is that the phone is there for one exact task, while a smart phone allows to
install applications that change its purpose. The spider’s ganglion has evolved for the
spider to find a suitable place for a spinning a web, spinning a web, catching food,
killing and eating the food, and finding a mate and reproducing. A primate’s brain,
while essentially also performing the same tasks of nesting, metabolizing, and reproducing, allows the primate to use tools and perform tasks in a variety of di↵erent
ways than dictated by evolution. This book will therefore use the following, more
specific, definition of intelligence.
definition 5.2 (Intelligence) The ability to adapt to changing tasks.
Of course, adapting to new tasks is what students do in school or what makes a
personal assistant device be called “artificial intelligence”. However, the above definition does not help us with any quantification of the ability. For this, again, we need
capacity.
5.1 Intellectual Capacity
45
definition 5.3 (Intellectual Capacity) The number of tasks an intelligent system is
able to adapt to.
So intellectual capacity is a volume of tasks. As one can see, this is hard to measure
for systems as complex as biological intelligence. A typical IQ test will approximate
the number of tasks a person can adapt to by showing the subject a number of different problems to solve in a well-defined amount of time. The IQ is then a relative
measure that compares the number of tasks correctly solved by an individual against
the number of tasks typically solved by a population. An IQ of 100 means the person
has solved the same number of tasks as an average person would. A higher IQ number means more tasks, a lower number means less tasks, scaled along the standard
deviation of the population’s IQ.
For artificially created systems, we can be much more rigorous. Since Chapter 2,
we know that we adapt a finite state machine to the function implied by our training
data. Therefore, we can define machine-learning model capacity as follows:
definition 5.4 (Model Capacity) The number of unique target functions a machine
learning system is able to adapt to.
Even though this definition seems straightforward, measuring the actual capacity
of a machine learner is not. Model capacity is usually a function of the number of
model parameters (MacKay 2003), the representation function used (see Section ??),
and the algorithm chosen to train the model. Furthermore, machine learning models
are able to adapt to a target function in two ways: Through training and through
generalization.
In fact, memory is already a mechanism that is trainable without generalization. n
bits of memory can adopt to 2n di↵erent states. Defining each state as a task makes
memory quite intelligent: it’s intellectual capacity grows exponentially with the storage capacity. However, as we intuitively know, storage of knowledge is just one part
of being intelligent. The other part is being able to react to unknown situations. This
is, to generalize the knowledge. In this chapter we focus on memorization and in
Chapter 6 we will focus on the second part of intelligence: generalization.
5.1.1
Minsky’s Criticism
In 1969, Marvin Minsky and Seymour Papert (?) argued that there were a number
of fundamental problems with researching neural networks. They argued that there
were certain tasks, such as the calculation of topological functions of connectedness
or the calculation of parity, which Perceptrons could not solve1 . Of particular significance was the inability of a single neuron to learn to evaluate the logical function
of exclusive-or (XOR). The results of Minsky and Papert’s analysis lead them to the
conclusion that, despite the fact that neurons were “interesting” to study, ultimately
neurons and their possible extensions were a, what they called, “sterile” direction of
1
Perceptron being the original name for a single neuron
46
Capacity of Machine Learning Models
research. What Minsky and Paper seemed to have ignored is that their problem had already been solved by Thomas Cover in 1964. His PhD thesis discussed the statistical
and geometrical properties of “linear threshold devices”, a summary was published
as (?). Thomas Cover was among the first people to actually follow the concluding
comment on Rosenblatt’s original Perceptron paper (?): “By the study of systems
such as the perceptron, it is hoped that those fundamental laws of organization which
are common to all information handling systems, machines and men included, may
eventually be understood”. This is, Thomas Cover worked on understanding the Perceptron’s information properties.
5.1.2
Cover’s Solution
Thomas cover found that a single neuron with linear activation, this is a system that
thresholds the experimental inputs xi using a weighted sum ⌃ni=1 xi wi 0 (with wi being the weights), has an information capacity of 2 bits per weight wi . This is, the ab2 bits
solute maximum mutual information it can model I(X; Y) = weight
. Cover’s article (?)
also investigated other representation functions but this is beyond the scope of this
chapter. The main insight was that the information capacity of a single neuron could
be determined by a formula that was already derived in 1852 by Ludwig Schläfli (?).
Cover called it the Function Counting Theorem. It counts the maximum number of
linearly separable sets C of n points in arbitrary position in a d-dimensional space.
theorem 5.5 (Function Counting Theorem)
C(d, n + 1) = C(d, n) + C(d1, n)
(5.1)
with boundary conditions C(d > 0, 1) = 2 (a single point can be classified either way)
and C(0, n) = 0, where n is the number of points and d is the number of dimensions.
Note that this equation immediately transforms into the information capacity of a
neuron when d is set to the number of parameters and n is the number of rows of a
training table in a binary classification problem. In other words, a single Neuron has
the intellectual capacity to train any binary classifier where the number of rows in
the training table is smaller or equal than the number of experimental factor columns.
We are now already in a position to respond to Minsky’s criticism: While there are
16 unique tables of the form (x1 , x2 , f (~x)), representing all 2-variable boolean functions, a single neuron can only represent C(3, 4) = 14 function of 4 points with 3
dimensions. The two that are missing are XOR and equality.
In order to see how C(n, d) behaves in general (for non-binary cases), why it can
be used to other machine learners (including networks of neurons), and how to find
out apriori that boolean XOR and equality are not trainable, we need to dig into more
detail.
5.1.3
MacKay’s Insight
TODO: Summarize MacKay, Chapter 40.
5.2 Memory Equivalent Capacity of a Model
5.2
47
Memory Equivalent Capacity of a Model
Cover’s function counting theorem together with the idea of intellectual capacity and
MacKay’s understanding of a machine learner as a parameter memorizer, immediately leads to one of the most important definitions in this book.
definition 5.6 (Memory Equivalent Capacity) A model’s intellectual capacity is
memory equivalent to N bits when the machine learner is able to represent all 2N
binary labeling functions of N uniformly random inputs.
5.3
History of Complexity Measurements
5.3.1
Kolmogorov Complexity
5.3.2
Shannon Number
5.3.3
P vs NP
5.3.4
VC Dimension
5.3.5
Physical Work
5.3.6
Example: Why do Diamonds exist?
5.4
Exercises
5.5
Further Reading
Download