index

Answering Watson’s Question: Progressive Abbreviation, and Number Theory Using Spreadsheets L. Van Warren MS CS, AE May 5, 2006 Abstract A common problem in number theory is addressed using two methods of brute force search. The first search method is implemented as a ‘C’ program. The second search is performed in a spreadsheet. A generalized form of counting is developed to enable generation of the range indices for the second search method. This generalized form of counting takes the form of a flexible base machine. The flexible base machine is then used to generate the indices for a Cartesian product search. In the process, two observations are made, and an answer to a question posed by James Watson is suggested. The number problem is similar to bin packing problems. Progressive abbreviation, chunking, and grouping are related constructs in this context. Preamble I want to bring a couple of ideas to the front before we embark on our number problem journey. Try not to let their simplicity distract you. Suppose the following pattern is observed: { 101 blah blah blah 101 blah blah blah … } Let the shorter token A abbreviate 101 in the pattern above. Substitution produces: { A blah blah blah A blah blah blah … } Then let the shorter token B abbreviate blah blah blah in the above pattern to produce: {ABAB…} Continuing this abbreviation progressively with C abbreviating A B yields: {CCC…} and we have progressively abbreviated to saturation. Changing gears a little consider a number in base 10, say one hundred and one, 101. This number is the dot product of two vectors, a basis vector and a coefficient vector. Reading right to left the basis vector is: { …, 100, 10, 1} which can also be rewritten: { …, 102, 101, 100} and the coefficient vector is: { …, 1, 0, 1} the dot product is: 1 x 102 + 0 x 101 + 1 x 100 = 10110 where the subscript indicates the base. 1 We can run the same process again in base two, changing the basis vector to base 2: { …, 4, 2, 1} which can also be rewritten: { …, 22, 21, 20} and the coefficient vector is: { …, 1, 0, 1} the dot product is: 1 x 22 + 0 x 21 + 1 x 20 = 1012 = 510 where the subscript again indicates the base. Now consider the process in base Q: { …, QQ, Q, 1} which can also be rewritten: { …, Q2, Q1, Q0} and the coefficient vector is: { …, 1, 0, 1} the dot product is: 1 x Q2 + 0 x Q1 + 1 x Q 0 = Q2 + 1 where the result is independent of base. Consider what happens when Q is -2 or the imaginary number i. But first, the problem at hand. Introduction Daughter and I are at a high school math fair. A problem is written on a file card. The problem is: Choosing from {16, 17, 23, 24, 29} Show how you can make a sum of exactly 100. There is a naïve part and a subtle part to this development. First, The Naïve Part: There are several interesting things about this problem: a) To proceed, the solver must infer that multiple instances of a given integer are required. What enables this to be obvious to one and opaque to another? Instinct? Training? Pattern Recognition? b) How much time is required to find all solutions by hand? c) How many trial solutions exist? d) How many actual solutions exist? e) Do rearrangements exist that make the problem easier to solve. Recall the incident when Gauss was asked by his teacher to add the first 100 integers. He did so by pairing 1 with 100, 2 with 99, etc. recognizing fifty such pairs. This pairing transformed a long addition into a short multiplication: 50 x 101 = 5050. The insight was not merely numerical. It was pattern recognition and viewpoint. 2 Development Consider solutions of the form: 1) a x + b y + c z + d w + e v = 100 Reversing the list, the specific problem would look something like: 2) a 29 + b 24 + c 23 + d 17 + e 16 = 100 Now to enumerate trial solutions in an orderly way, the coefficients a thru e must be bounded from above and below. To first order, trial values of a would vary from 0 to k such that 3) a 29 <= 100 4) k = floor(100/a) = floor(100/29) = 3 where we agree that floor rounds down to the nearest integer. We might limit a further by considering the likely participation of other integers. That is not done here. So we replace the coefficients a thru e in equation 2 by the intervals computed using the method of equations 3 and 4. Representing intervals in square braces [0-ka-e] we obtain: 5) [0-3] 29 + [0-4] 24 + [0-4] 23 + [0-5] 17 + [0-6] 16 = 100 The number of trial solutions is size of the five-dimensional Cartesian product. The first estimate of this number was: 6) 3  4  4  5  6 = 1440 but this answer is an underestimate. To check the interval from [0-k] requires k+1 operations as any ‘C’ programmer will quickly tell you. Correcting this adds one to each upper bound to makes a threefold difference in the product. The non-defective answer is: 7) 4  5  5  6  7 = 4200 and we have answered question c. There are 4200 trial solutions. The amount of time required to generate all trial solutions is estimated as follows: Assume these calculations are performed on an infix calculator at the rate of one keystroke per second. Trial solutions of equation 2 require four adds and five multiplies each. Each add requires six keystrokes and each multiply five, because all adds involve pairs of two-digit numbers and all multiplies involve one-digit numbers times two-digit numbers. The ‘+’, ‘x’ and ‘=’ operators require one keystroke each. Adding it all up: 8) 8) 9) 10) time time time time per add: 6 seconds per multiply: 5 seconds per trial solution: 4 * 6 + 5 * 5 = 55 seconds for trial solutions: 55 x 4200 = 231,000 sec (64 hours 10 minutes) Sixty-four hours requires the patience of David Blaine who at this moment has been suspended in a spherical tank of water for nearly a week. One impulse was to throw this problem into a spreadsheet to see how it looked. David Blaine looks unexpectedly round due to the spherical nature of the tank. Would using a spreadsheet lead to unexpected observations? 3 Would a compact ‘C’ program suffice to find the solutions? The program took 5 minutes to write, 5 minutes to debug and 0.03 seconds to run. That is 385 times faster than doing it by hand for the first run and 7.7 million times faster for subsequent runs. A computer makes one at least 385 times more productive than a magician floating in a tank of water, clever though he is. For completeness, the ‘C’ code for finding the solutions follows: main() { int int int int iteration = 0; i, j, k, l, m; a[5] = { 3, 4, 4, 5, 6}; x[5] = {29, 24, 23, 17, 16}; for(i = 0; i < a[0]; i++) for(j = 0; j < a[1]; j++) for(k = 0; k < a[2]; k++) for(l = 0; l < a[3]; l++) for(m = 0; m < a[4]; m++) { if( (i*x[0] + j*x[1] + k*x[2] + l*x[3] + m*x[4]) == 100 ) printf("%4d: %2d %2d + %2d %2d + %2d %2d + %2d %2d + %2d %2d = 100\n", iteration, i, x[0], j, x[1], k, x[2], l, x[3], m, x[4]); iteration++; } } This answers question d. The run yielded three solutions, where iterations are numbered from zero: Iteration Iteration Iteration 26: 513: 750: 0 29 + 1 29 + 1 29 + 0 24 + 0 24 + 2 24 + 0 23 + 1 23 + 1 23 + 4 17 + 0 17 + 0 17 + 2 16 = 100 3 16 = 100 0 16 = 100 Placing this in tabular form: Solution 0 1 2 Obtained on Iteration 26 513 750 Value {0, 0, 0, 4, 2} {1, 0, 1, 0, 3} {1, 2, 1, 0, 0} Equiv. Hand Cranking 24 min: 45 s 7 hr: 50 min: 15 s 11 hr: 27 min: 30 s Reversing the coefficients finds the first solution more quickly, a useful trick to remember. Solution 0 1 2 Obtained on Iteration 20 673 734 Value {0, 0, 1, 2, 1} {2, 4, 0, 0, 0} {3, 0, 1, 0, 1} Equiv. Hand Cranking 18 min: 20 s 10 hr: 16 min: 55 s 11 hr: 12 min: 50 s These solutions occupy a five-dimensional discrete counting space. We have answered question b. This leaves questions a and e. Question a dealt with how different people solve the same problem and question e dealt with rearrangements of the problem that speed solution. Note that even though one finds all solutions within eleven hours and change, that one must evaluate all the trial solutions to prove that no others exist. There is a certain irony to note. I spent about as much time with this problem as manual search would require. However, subsequent problems of this type are solved faster because the pattern is 4 understood. This last remark addresses question a. There are two people in this solution, the one before the pattern is recognized and the one after. Rearrangements Refreshing our mental image of the problem, we look for rearrangements. Choosing from {29, 24, 23, 17, 16} Show how you can make a sum of exactly 100. Question a implies a follow-up. What mental refresh rate is required by mathematicians vs. nonmathematicians?. Are numbers are stored visually, aurally, linguistically, some other way, or all of the above? When James Watson, the co-discoverer of DNA’s double helix was asked by newsman Charlie Rose what question he would like to know the answer to Watson replied, “Where [in the brain] is the number four stored?”. That will be addressed in a moment. In the meantime, question e asks if there are rearrangements that make the math fair problem more tractable. We have: 11) a 29 + b 24 + c 23 + d 17 + e 16 = 100 In this instance of the problem, two of the coefficients are odd and three are even, a condition that would vary in any generalization. Exploiting this idea nonetheless, we expand to: 12) a ( 20 + 9) + b (20 + 4) + c (20 + 3) + d (10 + 7) + e (10 + 6) = 100 There is no factoring (obvious to me) that simplifies this equation, so its solution retains its brute force search nature. We live in an age where brute force search is enabled by computers. This has been exploited in symbolic manipulation. It is exploited here also. Subtle Part: Finding the solution using a ‘C’ program is simpler and faster than using a spreadsheet. Looping through the Cartesian product makes setting the bounds on the search intervals easy. However it was in building the machinery to handle the spreadsheet version that two interesting observations were made. Is this because the intermediate steps were explicitly visualized? As developed in the preamble, it is common practice to represent numbers using placeholder notation. Placeholders are just linear combinations of successive powers of a given base. So ordinary counting bears a striking resemblance to the math fair problem: 13) In base 10, the number 101 means 1 x 102 + 0 x 101 + 1 x 100 = 101 14) In base 2, the number 101 means 1 x 2 2 + 0 x 21 + 1 x 20 = 5. The same argument works for any base. The original problem was solved by searching linear combinations of integers that satisfied the given equation. Behold, counting numbers can be represented as a linear combinations of a more general set of numbers than successive powers. Basis and Coefficient Vector Make a Number System In both the original problem, and in the representation of any counting number, there are two unstated vectors at play. One is the basis vector {100, 10, 1} that specifies the decades of the number. The other is the coefficient vector {1, 0, 1} that defines the values present in each decade. 5 We could write the number 101 as 1:0:1, but conventional placeholder notation insures that a single digit owns each decade that contributes to the number. To release the restriction we use colons. a:b:c:d to represent each coefficient of a multi-digit number in a flexible base format. This discussion is more easily demonstrated with a spreadsheet. And to that we must briefly turn. With the spreadsheet is possible to construct working solutions to number theory problems and find the general rules that embody them. Poor man’s induction if you will, that will reach its conclusion in the appendix. Consider the following figure. Figure 1 Spreadsheet The spreadsheet must index through decades of trial values in an orderly fashion, starting with the least significant rightmost trial values and incrementing left values when we complete a trip through a decade and roll over to the next value to the left. This will produce an unexpected reward. Next, a three decade counter is constructed, which works in any base. For example, base two: Figure 2 Spreadsheet A base 10 version requires only a simple change of basis from {4, 2, 1} to {100, 10, 1}. 6 Figure 3 spreadsheet The spreadsheet has been split to account for its size. The split is shown by the horizontal bar. We are almost ready to make our observation. Here are the recurrence formulas the sheet uses: Figure 4 Spreadsheet Formula Detail The second argument in the floor function is the number of significant digits to provide. The Observation Returning to our Figure 3 spreadsheet, we are not obligated to provide a basis vector whose numbers are successive powers. We could choose numbers from our original problem: Figure 5 Spreadsheet 7 Linear combinations make the representation of any integer possible with just one restriction. Using the flexible base, we can write 57 using the basis vector {23, 16, 1} that number is just 2:0:11, which expands to 2 x 23 + 11 x 1 = 46 + 11 = 57. The parity sum at the top proves that the target number and sum are always the same for all target trial solutions examined. It is strange and beautiful to expand a base represented by a scalar, to a base represented by a vector of numbers that have no relationship except being different in size. In all counting there is the process of progressive abbreviation, we are always using a base that is represented by a vector, whether the base is {100, 10, 1} or the base is {23, 16, 1}. Progressive abbreviation is a powerful and reversible concept. It is linguistic, mathematical, and instinctive. Is Progressive abbreviation a shorthand mechanism by which concepts are linked in the brain? To answer Watson’s question above: The number four is stored as a progressive abbreviation, chosen by each person that links to an expanding neural network of related notions. If that minimal progressive abbreviation is removed, the links to the concept are broken and an aphasia of four is the result. An interesting restriction must be in place for flexible base to work. There must be a 1 in the lowordered place, or it becomes impossible to represent all numbers. Said another way, absent 1, closure in the counting system is lost. Trading for that loss, we immediately obtain the power to approximate periodic functions. This has interesting properties and applications as suggested below. Absent 1 in the basis set, the sum - target sequence is a generator than undulates with unique patterns that depend on the basis vector. These patterns can be used to approximate discrete functions whose variation undulates in a similar way. Thus we have the first observation: Closure of a counting system can traded for approximation of periodic functions. Figure 6 Spreadsheet Absent 1 in the basis vector there is not full coverage of the linear combination and the representation cannot represent all integers. Here is one interesting pattern of non-closure: 8 Figure 6a Spreadsheet Graph – Absent One Generates Discrete Function Similar arguments can be made for real numbers. Figure 7 Spreadsheet In base 10 and base 2 number systems, we have a convention for the basis vector - the base itself which gives the rule for creating the basis vector. So “base 10” is the progressive abbreviation for a basis vector of {…, 100, 10, 1} and these in turn progressively abbreviation common counting numbers. We are used to this. Counting numbers are just the coefficient vector written without delimiters and with the assumption of a basis vector. A second observation is that: A basis and coefficient vector pair implements a number system more general than a conventional number system. Thus counting is a form of progressive abbreviation. Those possessing the instinctive ability to abbreviate, that is to compress and expand linguistic and symbolic relationships gave us the ability to count and all kinds of mathematical opportunity. This further amplifies the answer to question a above. It may be that concepts of number all originated in the act of abbreviation, historically speaking. Now we will use the decade counters we generate to solve the original number theory problem in a non-looping construct. This non-looping construct is independent of base. 9 The original math fair problem is now cast in the non-looping spreadsheet construct. This brings together the naive and subtle developments using the flexible base machine. To accomplish a basis vector must be constructed that is six elements long. The recurrence relationships are similar to above, but complicate with each additional element as shown in Appendix. Analysis if the structure of the recurrence relationships yields the pattern necessary for basis vectors of any length. The recurrence relationships are constructed in a long form and then abbreviated by substitution. This process is shown for the five-element case in Figures 8 and 9. Figure 8 is a transitional construct. Figure 8 Spreadsheet Figure 9 Spreadsheet Excerpt – Increasing Complexity A six-element version of the flexible base machine will be used to solve the original problem below. We needed a flexible base decade counter and one has been constructed - a bizarre and useful device similar to mixed-base appliances such as analog clocks. 10 Figure 10 Spreadsheet Using Flexible Base to Set Bounds We use a modified form of the decade machine to provide the multipliers that search the space and the same solutions are obtained, but at different iterations: Iteration 30: Iteration 1095: Iteration 1512: {1, 2, 1, 0, 0} {0, 0, 0, 4, 2} {1, 0, 1, 0, 3} Understanding why the iteration numbers differ between the ‘C’ program and the spreadsheet will be left as an informative exercise for the reader. The answer is given after the Appendix. Application - Data Compression and Encoding The patterns that occur absent 1 in the flexible base create unique patterns. When these patterns correspond to the value of a signal, such as intensity or volume, they can be used to encode the pattern of a picture or a sound. The basis vector can be transmitted only as often as a change of basis is necessary to represent the image optimally. This turns the encoding of an image into a search for the optimal basis and coefficient vector that represents the image. If done in short runs, this need not be overly expensive computationally. One can in fact define the image interpolant by the length of the coefficient and basis vectors. Note that they are always the same length. One would expect the encoding step to be more expensive than the decoding step provided the basis vectors were published as part of the image data. The basis vectors form the key necessary to unlock the image. This could have applications in selling digital image or sound data. The data is free. You pay for the basis vector key. The technique could be applied recursively so that even the key itself could be represented as a basis vector + coefficient vector + data stream. This might be useful for compartmentalized security where portions of a document are at different levels of classification - one document, many keys. Keys would be distributed to users according to their privilege or classification category. Could multi-level security benefit from this? 11 Application - Encryption One useful encryption technique is to publish the number (coefficient vector without delimiters except for colons as noted above) without publishing the basis vector. Breaking the code then becomes a task in searching for the basis vector that creates a recognizable message. Search time grows exponentially with basis vector size. One tip off to the contents of the basis vector would be the presence of colons. Colons surrounding a number give an upper bound to the size of that entry in the basis vector. Application – Progressive Abbreviation of DNA DNA consists of coding and non-coding parts. These parts can be labeled by a simpler basis set, associated with function and the genome can be compacted by the progressive abbreviation technique illustrated above. Consider James Watson. He uncovered the double-helix structure of DNA and then posed a question that led us to the next step. A spreadsheet that demonstrates all calculations is available from the author. Appendix – Building the Recurrence Relations for the Six-Fold Case This will be done without further explanation by induction from the fivefold case which was built from the threefold case. G8=FLOOR($B8/G$6,1) // Slot 6 I8=FLOOR(($B8-G8*G$6)/I$6,1) // Slot 5 K8=FLOOR((($B8-G8*G$6)-I8*I$6)/K$6,1) // Slot 4 M8=FLOOR(((($B8-G8*G$6)-I8*I$6)-K8*K$6)/M$6,1) // Slot 3 O8=FLOOR((((($B8-G8*G$6)-I8*I$6)-K8*K$6)-M8*M$6)/O$6,1) // Slot 2 Q8=FLOOR(((((($B8- G8*G$6)-I8*I$6)-K8*K$6)-M8*M$6)-O8*O$6)/Q$6,1) // 1 Answer to Question: Loop Order in the ‘C’ program searches right to left. 12

index

Related documents

Products

Support

index

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib