5 Capacity of Machine Learning Models In the previous chapter we have investigated how to estimate the information content of the training data in bits. Since the purpose of a model is to generalize the experimental results into a rule that can predict future experiments, most people would intuitively agree that a model should not be more complex than the training data. We introduced this intuition in Chapter 2, especially Section 2.2.1. More formally, the information content of the training data should give an upper limit for the complexity of the model. But how does one measure the complexity of a model? To measure the complexity of the model, we need the notion of model capacity, which is the topic of this chapter. Intuitively, the word capacity describes how much something can contain. For example, the capacity of an elevator describes how many people can ride in it. In general, the physical capacity of a container describes how much content can be stored in it. The capacity of a container is often considered equivalent to it’s volume (the actual capacity, however, must always be a bit smaller than the volume of the container as we have to account for technicalities like the walls of the container). In computer science, memory capacity describes how much information can be stored in some information storage, for example RAM or a hard disk. As discussed in the previous chapter, volumes of data are measured in bits and, for practical reasons, in bytes, kilobytes, megabytes, gigabytes, terabytes, petabytes etc., whereby 1 byte equals 8 bits. As explained in Section 4.4, the complexity of our training data is measured as the mutual information between the input (experimental factors) and the output (experimental results). This is because the training data describes a function from the input to the output. Therefore, the capacity of a model describes the maximum complexity of a function that can be stored in a model. Formally, it is defined as follows. definition 5.1 (Information Capacity) C = sup I(X; Y) where C is the capacity measured in bits and I(X; Y) is the mutual information as defined in Definition 4.13. The supremum (abbreviated sup) of a subset S of a partially ordered set P is the least element in P that is greater than or equal to each element of S , if such an element exists. Consequently, the supremum is also referred to as the least upper bound. Informally, the capacity is therefore the least upper bound of the set of mutual information of all possible functions that could be implemented from X to Y. This notion can be counter-intuitive at first and, in fact, it has been a topic of 44 Capacity of Machine Learning Models discussion and challenges for decades. For this reason, this chapter contains the optional Section 5.3. It provides a historic perspective on complexity measurements of functions that does not contribute to the main thread of this book but is intended to promote big-picture understanding of complexity. 5.1 Intellectual Capacity Just as we did in Chapter 2, let us understand how capacity is modelled in humans first before we apply the metrics to machine-generated models. The prominent example of a first study conducted to see what humans are able to learn was conducted by Binet and Simon in 1904 (Binet & Simon 1904). They looked at what children of di↵erent social and ethnic status can learn and ultimately came up with our currently well known measure of Intelligence Quotient or (IQ). Their study was so impactful that, as of writing this book, the standard IQ test is still called the Stanford-Binet test. Their definition of intelligence is short and therefore in the author’s opinion too general but it serves as a very valuable starting point. Binet and Simon defined intelligence as “the ability to adapt”. The problem with this definition is that there are many things that seem to be able to adapt without being intelligent. For example, when one puts a foot into sand, the sand adapts around it to form a footprint. Nobody would think sand is intelligent. Instead, the ability of the sand to adapt around the foot is explained by Isaac Newton’s “actio est reactio” (action is followed by reaction) principle ((?)). Since this principle is universal, the entire universe would be intelligent as per Binet and Simon’s definition. The definition worked for them, as they limited the work space of this definition to children learning in school. For our purposes, we should be able to distinguish smart phones from regular phones or a primate’s brain from a spider’s ganglion. This requires to be more specific about the adoption part. What makes a smart phone more intelligent than a regular phone is that the phone is there for one exact task, while a smart phone allows to install applications that change its purpose. The spider’s ganglion has evolved for the spider to find a suitable place for a spinning a web, spinning a web, catching food, killing and eating the food, and finding a mate and reproducing. A primate’s brain, while essentially also performing the same tasks of nesting, metabolizing, and reproducing, allows the primate to use tools and perform tasks in a variety of di↵erent ways than dictated by evolution. This book will therefore use the following, more specific, definition of intelligence. definition 5.2 (Intelligence) The ability to adapt to changing tasks. Of course, adapting to new tasks is what students do in school or what makes a personal assistant device be called “artificial intelligence”. However, the above definition does not help us with any quantification of the ability. For this, again, we need capacity. 5.1 Intellectual Capacity 45 definition 5.3 (Intellectual Capacity) The number of tasks an intelligent system is able to adapt to. So intellectual capacity is a volume of tasks. As one can see, this is hard to measure for systems as complex as biological intelligence. A typical IQ test will approximate the number of tasks a person can adapt to by showing the subject a number of different problems to solve in a well-defined amount of time. The IQ is then a relative measure that compares the number of tasks correctly solved by an individual against the number of tasks typically solved by a population. An IQ of 100 means the person has solved the same number of tasks as an average person would. A higher IQ number means more tasks, a lower number means less tasks, scaled along the standard deviation of the population’s IQ. For artificially created systems, we can be much more rigorous. Since Chapter 2, we know that we adapt a finite state machine to the function implied by our training data. Therefore, we can define machine-learning model capacity as follows: definition 5.4 (Model Capacity) The number of unique target functions a machine learning system is able to adapt to. Even though this definition seems straightforward, measuring the actual capacity of a machine learner is not. Model capacity is usually a function of the number of model parameters (MacKay 2003), the representation function used (see Section ??), and the algorithm chosen to train the model. Furthermore, machine learning models are able to adapt to a target function in two ways: Through training and through generalization. In fact, memory is already a mechanism that is trainable without generalization. n bits of memory can adopt to 2n di↵erent states. Defining each state as a task makes memory quite intelligent: it’s intellectual capacity grows exponentially with the storage capacity. However, as we intuitively know, storage of knowledge is just one part of being intelligent. The other part is being able to react to unknown situations. This is, to generalize the knowledge. In this chapter we focus on memorization and in Chapter 6 we will focus on the second part of intelligence: generalization. 5.1.1 Minsky’s Criticism In 1969, Marvin Minsky and Seymour Papert (?) argued that there were a number of fundamental problems with researching neural networks. They argued that there were certain tasks, such as the calculation of topological functions of connectedness or the calculation of parity, which Perceptrons could not solve1 . Of particular significance was the inability of a single neuron to learn to evaluate the logical function of exclusive-or (XOR). The results of Minsky and Papert’s analysis lead them to the conclusion that, despite the fact that neurons were “interesting” to study, ultimately neurons and their possible extensions were a, what they called, “sterile” direction of 1 Perceptron being the original name for a single neuron 46 Capacity of Machine Learning Models research. What Minsky and Paper seemed to have ignored is that their problem had already been solved by Thomas Cover in 1964. His PhD thesis discussed the statistical and geometrical properties of “linear threshold devices”, a summary was published as (?). Thomas Cover was among the first people to actually follow the concluding comment on Rosenblatt’s original Perceptron paper (?): “By the study of systems such as the perceptron, it is hoped that those fundamental laws of organization which are common to all information handling systems, machines and men included, may eventually be understood”. This is, Thomas Cover worked on understanding the Perceptron’s information properties. 5.1.2 Cover’s Solution Thomas cover found that a single neuron with linear activation, this is a system that thresholds the experimental inputs xi using a weighted sum ⌃ni=1 xi wi 0 (with wi being the weights), has an information capacity of 2 bits per weight wi . This is, the ab2 bits solute maximum mutual information it can model I(X; Y) = weight . Cover’s article (?) also investigated other representation functions but this is beyond the scope of this chapter. The main insight was that the information capacity of a single neuron could be determined by a formula that was already derived in 1852 by Ludwig Schläfli (?). Cover called it the Function Counting Theorem. It counts the maximum number of linearly separable sets C of n points in arbitrary position in a d-dimensional space. theorem 5.5 (Function Counting Theorem) C(d, n + 1) = C(d, n) + C(d1, n) (5.1) with boundary conditions C(d > 0, 1) = 2 (a single point can be classified either way) and C(0, n) = 0, where n is the number of points and d is the number of dimensions. Note that this equation immediately transforms into the information capacity of a neuron when d is set to the number of parameters and n is the number of rows of a training table in a binary classification problem. In other words, a single Neuron has the intellectual capacity to train any binary classifier where the number of rows in the training table is smaller or equal than the number of experimental factor columns. We are now already in a position to respond to Minsky’s criticism: While there are 16 unique tables of the form (x1 , x2 , f (~x)), representing all 2-variable boolean functions, a single neuron can only represent C(3, 4) = 14 function of 4 points with 3 dimensions. The two that are missing are XOR and equality. In order to see how C(n, d) behaves in general (for non-binary cases), why it can be used to other machine learners (including networks of neurons), and how to find out apriori that boolean XOR and equality are not trainable, we need to dig into more detail. 5.1.3 MacKay’s Insight TODO: Summarize MacKay, Chapter 40. 5.2 Memory Equivalent Capacity of a Model 5.2 47 Memory Equivalent Capacity of a Model Cover’s function counting theorem together with the idea of intellectual capacity and MacKay’s understanding of a machine learner as a parameter memorizer, immediately leads to one of the most important definitions in this book. definition 5.6 (Memory Equivalent Capacity) A model’s intellectual capacity is memory equivalent to N bits when the machine learner is able to represent all 2N binary labeling functions of N uniformly random inputs. 5.3 History of Complexity Measurements 5.3.1 Kolmogorov Complexity 5.3.2 Shannon Number 5.3.3 P vs NP 5.3.4 VC Dimension 5.3.5 Physical Work 5.3.6 Example: Why do Diamonds exist? 5.4 Exercises 5.5 Further Reading