index

advertisement
Answering Watson’s Question:
Progressive Abbreviation,
and Number Theory Using Spreadsheets
L. Van Warren MS CS, AE
May 5, 2006
Abstract
A common problem in number theory is addressed using two methods of brute force search. The
first search method is implemented as a ‘C’ program. The second search is performed in a
spreadsheet. A generalized form of counting is developed to enable generation of the range indices
for the second search method. This generalized form of counting takes the form of a flexible base
machine. The flexible base machine is then used to generate the indices for a Cartesian product
search. In the process, two observations are made, and an answer to a question posed by James
Watson is suggested. The number problem is similar to bin packing problems. Progressive
abbreviation, chunking, and grouping are related constructs in this context.
Preamble
I want to bring a couple of ideas to the front before we embark on our number problem journey.
Try not to let their simplicity distract you. Suppose the following pattern is observed:
{ 101 blah blah blah 101 blah blah blah … }
Let the shorter token A abbreviate 101 in the pattern above. Substitution produces:
{ A blah blah blah A blah blah blah … }
Then let the shorter token B abbreviate blah blah blah in the above pattern to produce:
{ABAB…}
Continuing this abbreviation progressively with C abbreviating A B yields:
{CCC…}
and we have progressively abbreviated to saturation. Changing gears a little consider a number in
base 10, say one hundred and one, 101. This number is the dot product of two vectors, a basis
vector and a coefficient vector. Reading right to left the basis vector is:
{ …, 100, 10, 1}
which can also be rewritten:
{ …, 102, 101, 100}
and the coefficient vector is:
{ …, 1, 0, 1}
the dot product is:
1 x 102 + 0 x 101 + 1 x 100 = 10110
where the subscript indicates the base.
1
We can run the same process again in base two, changing the basis vector to base 2:
{ …, 4, 2, 1}
which can also be rewritten:
{ …, 22, 21, 20}
and the coefficient vector is:
{ …, 1, 0, 1}
the dot product is:
1 x 22 + 0 x 21 + 1 x 20 = 1012 = 510
where the subscript again indicates the base.
Now consider the process in base Q:
{ …, QQ, Q, 1}
which can also be rewritten:
{ …, Q2, Q1, Q0}
and the coefficient vector is:
{ …, 1, 0, 1}
the dot product is:
1 x Q2 + 0 x Q1 + 1 x Q 0 = Q2 + 1
where the result is independent of base.
Consider what happens when Q is -2 or the imaginary number i. But first, the problem at hand.
Introduction
Daughter and I are at a high school math fair. A problem is written on a file card. The problem is:
Choosing from {16, 17, 23, 24, 29} Show how you can make a sum of exactly 100.
There is a naïve part and a subtle part to this development. First,
The Naïve Part:
There are several interesting things about this problem:
a) To proceed, the solver must infer that multiple instances of a given integer are required.
What enables this to be obvious to one and opaque to another? Instinct? Training? Pattern
Recognition?
b) How much time is required to find all solutions by hand?
c) How many trial solutions exist?
d) How many actual solutions exist?
e) Do rearrangements exist that make the problem easier to solve. Recall the incident when
Gauss was asked by his teacher to add the first 100 integers. He did so by pairing 1 with
100, 2 with 99, etc. recognizing fifty such pairs. This pairing transformed a long addition
into a short multiplication: 50 x 101 = 5050. The insight was not merely numerical. It was
pattern recognition and viewpoint.
2
Development
Consider solutions of the form:
1)
a x + b y + c z + d w + e v = 100
Reversing the list, the specific problem would look something like:
2)
a 29 + b 24 + c 23 + d 17 + e 16 = 100
Now to enumerate trial solutions in an orderly way, the coefficients a thru e must be bounded from
above and below. To first order, trial values of a would vary from 0 to k such that
3)
a 29 <= 100
4)
k = floor(100/a) = floor(100/29) = 3
where we agree that floor rounds down to the nearest integer. We might limit a further by
considering the likely participation of other integers. That is not done here.
So we replace the coefficients a thru e in equation 2 by the intervals computed using the method
of equations 3 and 4. Representing intervals in square braces [0-ka-e] we obtain:
5)
[0-3] 29 + [0-4] 24 + [0-4] 23 + [0-5] 17 + [0-6] 16 = 100
The number of trial solutions is size of the five-dimensional Cartesian product. The first estimate of
this number was:
6)
3  4  4  5  6 = 1440
but this answer is an underestimate. To check the interval from [0-k] requires k+1 operations as
any ‘C’ programmer will quickly tell you. Correcting this adds one to each upper bound to makes a
threefold difference in the product. The non-defective answer is:
7)
4  5  5  6  7 = 4200
and we have answered question c. There are 4200 trial solutions.
The amount of time required to generate all trial solutions is estimated as follows:
Assume these calculations are performed on an infix calculator at the rate of one keystroke per
second. Trial solutions of equation 2 require four adds and five multiplies each. Each add requires
six keystrokes and each multiply five, because all adds involve pairs of two-digit numbers and all
multiplies involve one-digit numbers times two-digit numbers. The ‘+’, ‘x’ and ‘=’ operators require
one keystroke each. Adding it all up:
8)
8)
9)
10)
time
time
time
time
per add: 6 seconds
per multiply: 5 seconds
per trial solution: 4 * 6 + 5 * 5 = 55 seconds
for trial solutions: 55 x 4200 = 231,000 sec (64 hours 10 minutes)
Sixty-four hours requires the patience of David Blaine who at this moment has been suspended in
a spherical tank of water for nearly a week. One impulse was to throw this problem into a
spreadsheet to see how it looked. David Blaine looks unexpectedly round due to the spherical
nature of the tank. Would using a spreadsheet lead to unexpected observations?
3
Would a compact ‘C’ program suffice to find the solutions? The program took 5 minutes to write, 5
minutes to debug and 0.03 seconds to run. That is 385 times faster than doing it by hand for the
first run and 7.7 million times faster for subsequent runs. A computer makes one at least 385
times more productive than a magician floating in a tank of water, clever though he is. For
completeness, the ‘C’ code for finding the solutions follows:
main()
{
int
int
int
int
iteration = 0;
i, j, k, l, m;
a[5] = { 3, 4, 4, 5, 6};
x[5] = {29, 24, 23, 17, 16};
for(i = 0; i < a[0]; i++)
for(j = 0; j < a[1]; j++)
for(k = 0; k < a[2]; k++)
for(l = 0; l < a[3]; l++)
for(m = 0; m < a[4]; m++)
{
if( (i*x[0] + j*x[1] + k*x[2] + l*x[3] + m*x[4]) == 100 )
printf("%4d: %2d %2d + %2d %2d + %2d %2d + %2d %2d + %2d %2d = 100\n",
iteration, i, x[0], j, x[1], k, x[2], l, x[3], m, x[4]);
iteration++;
}
}
This answers question d.
The run yielded three solutions, where iterations are numbered from zero:
Iteration
Iteration
Iteration
26:
513:
750:
0 29 +
1 29 +
1 29 +
0 24 +
0 24 +
2 24 +
0 23 +
1 23 +
1 23 +
4 17 +
0 17 +
0 17 +
2 16 = 100
3 16 = 100
0 16 = 100
Placing this in tabular form:
Solution
0
1
2
Obtained on Iteration
26
513
750
Value
{0, 0, 0, 4, 2}
{1, 0, 1, 0, 3}
{1, 2, 1, 0, 0}
Equiv. Hand Cranking
24 min: 45 s
7 hr: 50 min: 15 s
11 hr: 27 min: 30 s
Reversing the coefficients finds the first solution more quickly, a useful trick to remember.
Solution
0
1
2
Obtained on Iteration
20
673
734
Value
{0, 0, 1, 2, 1}
{2, 4, 0, 0, 0}
{3, 0, 1, 0, 1}
Equiv. Hand Cranking
18 min: 20 s
10 hr: 16 min: 55 s
11 hr: 12 min: 50 s
These solutions occupy a five-dimensional discrete counting space.
We have answered question b. This leaves questions a and e. Question a dealt with how different
people solve the same problem and question e dealt with rearrangements of the problem that
speed solution. Note that even though one finds all solutions within eleven hours and change, that
one must evaluate all the trial solutions to prove that no others exist.
There is a certain irony to note. I spent about as much time with this problem as manual search
would require. However, subsequent problems of this type are solved faster because the pattern is
4
understood. This last remark addresses question a. There are two people in this solution, the one
before the pattern is recognized and the one after.
Rearrangements
Refreshing our mental image of the problem, we look for rearrangements.
Choosing from {29, 24, 23, 17, 16} Show how you can make a sum of exactly 100.
Question a implies a follow-up. What mental refresh rate is required by mathematicians vs. nonmathematicians?. Are numbers are stored visually, aurally, linguistically, some other way, or all of
the above? When James Watson, the co-discoverer of DNA’s double helix was asked by newsman
Charlie Rose what question he would like to know the answer to Watson replied, “Where [in the
brain] is the number four stored?”. That will be addressed in a moment. In the meantime, question
e asks if there are rearrangements that make the math fair problem more tractable. We have:
11)
a 29 + b 24 + c 23 + d 17 + e 16 = 100
In this instance of the problem, two of the coefficients are odd and three are even, a condition that
would vary in any generalization. Exploiting this idea nonetheless, we expand to:
12)
a ( 20 + 9) + b (20 + 4) + c (20 + 3) + d (10 + 7) + e (10 + 6) = 100
There is no factoring (obvious to me) that simplifies this equation, so its solution retains its brute
force search nature. We live in an age where brute force search is enabled by computers. This has
been exploited in symbolic manipulation. It is exploited here also.
Subtle Part:
Finding the solution using a ‘C’ program is simpler and faster than using a spreadsheet. Looping
through the Cartesian product makes setting the bounds on the search intervals easy. However it
was in building the machinery to handle the spreadsheet version that two interesting observations
were made. Is this because the intermediate steps were explicitly visualized?
As developed in the preamble, it is common practice to represent numbers using placeholder
notation. Placeholders are just linear combinations of successive powers of a given base. So
ordinary counting bears a striking resemblance to the math fair problem:
13) In base 10, the number 101 means 1 x 102 + 0 x 101 + 1 x 100 = 101
14) In base 2, the number 101
means 1 x 2 2 + 0 x 21 + 1 x 20 = 5.
The same argument works for any base. The original problem was solved by searching linear
combinations of integers that satisfied the given equation. Behold, counting numbers can be
represented as a linear combinations of a more general set of numbers than successive powers.
Basis and Coefficient Vector Make a Number System
In both the original problem, and in the representation of any counting number, there are two
unstated vectors at play.
One is the basis vector {100, 10, 1} that specifies the decades of the number.
The other is the coefficient vector {1, 0, 1} that defines the values present in each decade.
5
We could write the number 101 as 1:0:1, but conventional placeholder notation insures that a
single digit owns each decade that contributes to the number. To release the restriction we use
colons. a:b:c:d to represent each coefficient of a multi-digit number in a flexible base format.
This discussion is more easily demonstrated with a spreadsheet. And to that we must briefly turn.
With the spreadsheet is possible to construct working solutions to number theory problems and
find the general rules that embody them. Poor man’s induction if you will, that will reach its
conclusion in the appendix. Consider the following figure.
Figure 1 Spreadsheet
The spreadsheet must index through decades of trial values in an orderly fashion, starting with the
least significant rightmost trial values and incrementing left values when we complete a trip
through a decade and roll over to the next value to the left. This will produce an unexpected
reward.
Next, a three decade counter is constructed, which works in any base. For example, base two:
Figure 2 Spreadsheet
A base 10 version requires only a simple change of basis from {4, 2, 1} to {100, 10, 1}.
6
Figure 3 spreadsheet
The spreadsheet has been split to account for its size. The split is shown by the horizontal bar. We
are almost ready to make our observation. Here are the recurrence formulas the sheet uses:
Figure 4 Spreadsheet Formula Detail
The second argument in the floor function is the number of significant digits to provide.
The Observation
Returning to our Figure 3 spreadsheet, we are not obligated to provide a basis vector whose
numbers are successive powers. We could choose numbers from our original problem:
Figure 5 Spreadsheet
7
Linear combinations make the representation of any integer possible with just one restriction.
Using the flexible base, we can write 57 using the basis vector {23, 16, 1} that number is just
2:0:11, which expands to 2 x 23 + 11 x 1 = 46 + 11 = 57. The parity sum at the top proves that
the target number and sum are always the same for all target trial solutions examined.
It is strange and beautiful to expand a base represented by a scalar, to a base represented by a
vector of numbers that have no relationship except being different in size. In all counting there is
the process of progressive abbreviation, we are always using a base that is represented by a
vector, whether the base is {100, 10, 1} or the base is {23, 16, 1}. Progressive abbreviation is a
powerful and reversible concept. It is linguistic, mathematical, and instinctive. Is Progressive
abbreviation a shorthand mechanism by which concepts are linked in the brain?
To answer Watson’s question above:
The number four is stored as a progressive abbreviation, chosen by each person that
links to an expanding neural network of related notions.
If that minimal progressive abbreviation is removed, the links to the concept are broken and an
aphasia of four is the result.
An interesting restriction must be in place for flexible base to work. There must be a 1 in the lowordered place, or it becomes impossible to represent all numbers. Said another way, absent 1,
closure in the counting system is lost. Trading for that loss, we immediately obtain the power to
approximate periodic functions. This has interesting properties and applications as suggested
below. Absent 1 in the basis set, the sum - target sequence is a generator than undulates with
unique patterns that depend on the basis vector. These patterns can be used to approximate
discrete functions whose variation undulates in a similar way. Thus we have the first observation:
Closure of a counting system can traded for approximation of periodic functions.
Figure 6 Spreadsheet
Absent 1 in the basis vector there is not full coverage of the linear combination and the
representation cannot represent all integers. Here is one interesting pattern of non-closure:
8
Figure 6a Spreadsheet Graph – Absent One Generates Discrete Function
Similar arguments can be made for real numbers.
Figure 7 Spreadsheet
In base 10 and base 2 number systems, we have a convention for the basis vector - the base itself
which gives the rule for creating the basis vector. So “base 10” is the progressive abbreviation for
a basis vector of {…, 100, 10, 1} and these in turn progressively abbreviation common counting
numbers. We are used to this. Counting numbers are just the coefficient vector written without
delimiters and with the assumption of a basis vector.
A second observation is that:
A basis and coefficient vector pair implements a number system more general
than a conventional number system.
Thus counting is a form of progressive abbreviation. Those possessing the instinctive ability to
abbreviate, that is to compress and expand linguistic and symbolic relationships gave us the ability
to count and all kinds of mathematical opportunity. This further amplifies the answer to question a
above. It may be that concepts of number all originated in the act of abbreviation, historically
speaking.
Now we will use the decade counters we generate to solve the original number theory problem in a
non-looping construct. This non-looping construct is independent of base.
9
The original math fair problem is now cast in the non-looping spreadsheet construct. This brings
together the naive and subtle developments using the flexible base machine. To accomplish a basis
vector must be constructed that is six elements long. The recurrence relationships are similar to
above, but complicate with each additional element as shown in Appendix. Analysis if the structure
of the recurrence relationships yields the pattern necessary for basis vectors of any length. The
recurrence relationships are constructed in a long form and then abbreviated by substitution. This
process is shown for the five-element case in Figures 8 and 9. Figure 8 is a transitional construct.
Figure 8 Spreadsheet
Figure 9 Spreadsheet Excerpt – Increasing Complexity
A six-element version of the flexible base machine will be used to solve the original problem below.
We needed a flexible base decade counter and one has been constructed - a bizarre and useful
device similar to mixed-base appliances such as analog clocks.
10
Figure 10 Spreadsheet Using Flexible Base to Set Bounds
We use a modified form of the decade machine to provide the multipliers that search the space and
the same solutions are obtained, but at different iterations:
Iteration
30:
Iteration 1095:
Iteration 1512:
{1, 2, 1, 0, 0}
{0, 0, 0, 4, 2}
{1, 0, 1, 0, 3}
Understanding why the iteration numbers differ between the ‘C’ program and the spreadsheet will
be left as an informative exercise for the reader. The answer is given after the Appendix.
Application - Data Compression and Encoding
The patterns that occur absent 1 in the flexible base create unique patterns. When these patterns
correspond to the value of a signal, such as intensity or volume, they can be used to encode the
pattern of a picture or a sound. The basis vector can be transmitted only as often as a change of
basis is necessary to represent the image optimally.
This turns the encoding of an image into a search for the optimal basis and coefficient vector that
represents the image. If done in short runs, this need not be overly expensive computationally.
One can in fact define the image interpolant by the length of the coefficient and basis vectors. Note
that they are always the same length.
One would expect the encoding step to be more expensive than the decoding step provided the
basis vectors were published as part of the image data. The basis vectors form the key necessary
to unlock the image. This could have applications in selling digital image or sound data. The data is
free. You pay for the basis vector key. The technique could be applied recursively so that even the
key itself could be represented as a basis vector + coefficient vector + data stream. This might be
useful for compartmentalized security where portions of a document are at different levels of
classification - one document, many keys. Keys would be distributed to users according to their
privilege or classification category. Could multi-level security benefit from this?
11
Application - Encryption
One useful encryption technique is to publish the number (coefficient vector without delimiters
except for colons as noted above) without publishing the basis vector.
Breaking the code then becomes a task in searching for the basis vector that creates a
recognizable message. Search time grows exponentially with basis vector size.
One tip off to the contents of the basis vector would be the presence of colons. Colons surrounding
a number give an upper bound to the size of that entry in the basis vector.
Application – Progressive Abbreviation of DNA
DNA consists of coding and non-coding parts. These parts can be labeled by a simpler basis set,
associated with function and the genome can be compacted by the progressive abbreviation
technique illustrated above. Consider James Watson. He uncovered the double-helix structure of
DNA and then posed a question that led us to the next step.
A spreadsheet that demonstrates all calculations is available from the author.
Appendix – Building the Recurrence Relations for the Six-Fold Case
This will be done without further explanation by induction
from the fivefold case which was built from the threefold case.
G8=FLOOR($B8/G$6,1)
// Slot 6
I8=FLOOR(($B8-G8*G$6)/I$6,1)
// Slot 5
K8=FLOOR((($B8-G8*G$6)-I8*I$6)/K$6,1)
// Slot 4
M8=FLOOR(((($B8-G8*G$6)-I8*I$6)-K8*K$6)/M$6,1)
// Slot 3
O8=FLOOR((((($B8-G8*G$6)-I8*I$6)-K8*K$6)-M8*M$6)/O$6,1) // Slot 2
Q8=FLOOR(((((($B8- G8*G$6)-I8*I$6)-K8*K$6)-M8*M$6)-O8*O$6)/Q$6,1) // 1
Answer to Question: Loop Order in the ‘C’ program searches right to left.
12
Download