From: AAAI-84 Proceedings. Copyright ©1984, AAAI (www.aaai.org). All rights reserved.
FIVE
PARALLEL
ALGORITHMS
ON THE
FOR
PRODUCTION
DAD0
MACHINE*
Salvatore
Computer
New York
EXECUTION
J. Stolfo
Science
Columbia
SYSTEM
Department
University
City,
N.Y.
10027
Abstract
In
this
paper
parallel
inherent
our
conclusions
under
by
algorithm
construction
in
a
abstract
is
empirically
on
the
at Columbia
algorithms
systems
designed
variety
Ongoing
programs.
each
five
production
algorithm
parallelism
system
specify
of
Each
machine.
of
we
execution
of
research
to
different
aims
evaluating
DAD02
on
to
the
prototype,
for
the
while
WM
elements
are
simple
lists
of
equals
sign),
constant
symbols
(corresponding
to tuples of the relational
An
example
production,
borrowed
from
the
algebra).
blocks world, is illustrated
in figure 1.
the
DAD0
capture
the
production
(Goal
substantiate
(Isa
performance
Clear-top-of
=x
(On-top-of
presently
(Isa
University.
Block)
Block)
=y
=y
=x)
-- > delete( On-top-of
Block)
assert(On-top-of
=y
=x)
=y
Table)
1 Introduction
If the
five abstract
algorithms
In this
paper
we outline
specifying
parallel
execution
of production
s stem
(PS)
programs
on the DAD0
machine.
Each algorit !I m offers a
number
of advantages
for particular
types of PS programs.
We expect
to implement
these algorithms
on the DAD02
prototype
and critically
evaluate
the performance
of each
on
a
variety
of
application
programs.
Software
development
is presently
underway
using
the
DAD01
prototype
that has been operational
at Columbia
University
since April, 1983.
is to clear
there
covered
by something
a block,
then
remove
the
Match:
For
each
rule,
matches
the
current
each
pattern
element
with
throughout
In general,
a Production
Systenl
(PS
[Newell 1973,
Davis
and
King
1975,
Rychener
1976,
Aorgy
19821 IS
defined
by a set of rules, or prodzlctions,
which form the
Production
Memory
(PM),
together
with
a database
of
called
the
Worlcing
Memory
(WM).
assert ions,
Each
production
consists
of a conjunction
of pattern
elements,
called the left-hand
side (LHS
of the rule, along with a
set of actions
called the rzgfzt- R and side (RHS).
The RHS
specifies
information
that
is to be added
to (asserted)
or
removed
from
WM when
the LHS successfully
matches
against
the contents
of WM.
the
rules
=y
LHS.
collected
is on =x
is on the
table.
Production.
system
determine
is matched
repeatedly
whether
the
of
WM:
environment
variables
the
are
that
=y
the
production
cycle of operations:
LHS
element
Systems
fact
that
An Example
1:
operation.,
the followmg
of a block,
(=y)
is also
assert
top
(=x)
which
Figure
In
executes
the
is a block
and
We
begin
with
a brief
description
of PS’s
and
identify
various
possible
characteristics
of PS programs
which
may not be immediately
apparent
from a general
description
of the basic formalism.
These
characteristics
lead to different
algorithms
which will be discussed
in the
remaining
sections
of this paper.
2 Production
goal
and
by
some
consistently
All
matching
in
instances
conflict
the
WM
bound
of
set
of
rules.
Select:
Choose
according
Act:
Add
exactly
to some
to
specified
in
perform
some
one of the
predefined
or
delete
the
RHS
from
of
matching
rules
criterion.
the
WM
all
selected
assertions
rule
or
operation.
Pattern
elements
in the LHS may have a variety
of
forms
which
are dependent
on the form
and content
of
WM elements.
In the simplest
case,
patterns
are lists
composed
of constants
and
variables
(prefixed
with
an
During
the
selection
phase
of production
system
execution,
a typical
interpreter
provides
confht
resolution
strategies
based on the recency of matched
data
in WM,
as syntactic
discrimination.
Other
resolution
as well
schemes
are possible,
but for the present
paper such issues
will not significantly
change
our analysis,
and hence will
not be discussed.
*This
research
h as been
supported
by
the
Defense
Advanced
Research
Projects
Agency
through
contract
N00039-84-C-0165,
as well as gr&ts
from
Intel,
Digital
Eauinment.
Hewlett-Packard.
Valid
Logic Svstems.
AT&T
Be’11&Labor&tories and IBM Corporations-and
-the New York
State
Science
and Technology
Foundation.
We gratefully
acknowledge
their support.
We shall only consider
the parallel
execution
of PS
programs
with the goal of accelerating
the rule firing rate
of the recognize/act
cycle as well as the number
of Wh4
transactions
perrormed.
In a later section
of this paper,
we shall consider
other
possible
parallel
activities
as, for
example,
the concurrent
execution
of multiple
PS programs.
300
state
new
of
temporal
this
to be always
3. Many
by
may
similar
1. Assign
some
of
WM.
algorithm
illustrated
in
of rules
to a set of (distinct)
on
WM.
subset
of WM
elements
processors
of a rule
operation
is not
appropriate.
(possibly
distinct
from
those
in
each
but
step
to
may
a
1).
until
no rule
6. Global
is active:
LHS
a.
Broadcast
storing
an
to
in
conflict
b.
the
system.
Report
Broadcast
the
step
3.b
to
their
local
2:
of
(non-temporally
action
specifications
have
large
is
global
the
unlikely,
of
not
scope
of
the
state.
matches.
which
may
relatively
need
in
effects
i.e., saving
pattern
The
potentially
Thus,
small.
access
small
to
all
a
of WM
subset
within
its instantiated
all
the
of
data
processors,
which
end
Production
in
update
System
prior
to
the
cycle.
to only
9 Small
Algorithm.
PM.
This very
PS cycle
10. Large
simple view of the parallel
implementatjon
forms the basis of our subsequent
analysis.
3 Characteristics
of Production
System
thousands
11. Large
Programs
thousands
In this section we enumerate
various
characteristics
of
PS programs
in general
terms.
The reader
will note that
these
characteristics
are less indicative
of a specific
PS
but
rather
are
characteristics
of
various
formalism,
problems
whose
solutions
are encoded
in rule form.
It
the “inherent
parallelism”
should
be noted,
though,
that
not
be represented
by the
problems
may
Ert;gc0u;“a”r PS f ormalksm used for their solution.
1. Temporal
made
on
between
need
Redundancy.
each
each
not
[Forgy
cycle,
be
19821
relatively
Rules.
few
WM
rules
on
need
changes
by
saving
matching
the
incorporating
to
WM
The
repeated.
Affected
changes
previous
is probably
PS interpreter
Few
Thus,
cycle.
this
Few
each
are
algorithm
be matched
are
This
against
case
of characteristic
On
each
conflict
initiating
the
number
of
rules
match
rules
may
5.
cycle
of
may
be
phase
of
is restricted
a few hundred.
WM.
PM.
Similarly,
A
of rules
WM.
WM
may
consist
of only
elements.
PS
may
Similarly,
of data
consist
of
several
in PM.
WM
may
consist
of
elements.
Algorithms
The reader
is assumed
to be knowledgeable
about the
Rete match
algorithm
(see [Forgy
19791 and [Forgy
19821 .
We will thus freely discuss
the details of the Rete mate h
We begin with a
when needed without
prior explication.
brief description
of the DAD0
architecture.
(The reader
is
19831 and [Stolfo and Miranker
encouraged
to see Stolfo
19841 for complete
d etails of the system.)
of a
strategy.
cycle,
elements
which
state
example
rules
WM
of
than
tests
In this section
we outline
five different
algori;ha;;
suitable
for direct execution
on the DAD0
machine.
will
be
independently
discussed
leadin
to
various
t fl ey are
most
conclusions
about
which
characteristics
Ongoing
research
aims to verify
appropriate
for capturing.
our conclusions
by empirically
evaluating
their performance
for different
classes of PS programs.
operations
Rete
best
4 Five
The
rather
value).
firings.
executed
next
of
converse
rule
WM,
in the
requiring
example,
threshold
number
8 Small
Repeat;
number
elements
conditions
of
(for
a
operation,
test
portions
as the
7 Multiple
entire
reported
the
be viewed
RHS.
to WM
accordingly.
Abstract
instance
the
large
constant
Pattern
may
elements
compare
some
rated
to
of WM.
rule
individual
local
compute
rule
changes
WM
a
a
a few hundred
by
rules.
WM
relatively
tests
of
access
phase,
instances.
processor,
rated
processors
match
maximally
each
maximally
all
the
formation
each
within
Figure
to
begin
set of matching
Considering
c.
instruction
rules
resulting
2. Few
where
in many
elements.
3. Repeat
of the
cycle.
to
elements
rule
rule
are
each
in situations
restricting
scope
rules
on
case,
seems
of WM
single
to a set of
WM
appear
may
Thus,
match
match
some
In this
number
processors.
2. Assign
RHS
that
guarantee
Many
to
example,
changes
the
not
case.
elements
5. Restricted
subset
for
arise,
however,
does
Rules.
changes
pattern
redundant).
update
abstract
the
the
4. Massive
algebraic
operation.
processed
by a parallel
by the
this
approach
figure 2.
alone
Affected
affected
This
Note,
WM.
redundancy
affected
and
against
thus
the
301
4.1
The
DAD0
4.2
Machine
DAD0
is
a
fine-grain,
parallel
machine
where
A
processing
and memory
are extensively
intermingled.
full-scale
production
version
of the DAD0 machine
would
the
order
of a hundred
comprise
a very
large
(on
each
thousand)
set
of
processing
elements
(PE’s ,
containing
its own processor,
a small amount
(16 k bytes,
rototype
version
of local
in the current
design
of the
random
access
memory
(RAM P, and
a specia I ized I/O
The PE’s are interconnected
to form a complete
switch.
binary tree.
Algorithm
1: Full
Distribution
of PM
small
number
of distinct
In
this
case,
a very
rules
are
distributed
to each
of the
1023
production
DAD02 PE’s, as well as all WM elements
relevant
to the
rules
in question,
i.e., only
those
data
elements
which
match
some pattern
in the LHS of the rules.
Algorithm
1
alternates
the entire DAD0
tree between
MIMD and SIMD
modes of operation.
The match
phase is implemented
as
an MIMD process,
whereas
selection
and act execute
as
SIMD operations.
In simplest
terms,
each PE executes
the match
phase
for its own small PS.
One such PS is allowed
to “fire”
a
which
is communicated
to all other
PE’s.
rule, however,
is illustrated
in figure 3.
The algorithm
Within
the DAD0
machine,
each PE is capable
of
executing
in either of two modes under the control
of runIn the first,
which
we will call SIA4D
time software.
mode
for single instruction
stream,
multiple
data stream),
the P Ii executes
instructions
broadcast
by some ancestor
PE within
the tree.
(SIMD typically
re&hs, t;A;;ingol;
stream
of “machine-level”
instructions.
the other
hand,
SIMD is generalized
to mean
a single
Thus,
stream
of remote
procedure
invocation
instructions.
DAD0 makes
more effective
use of its communication
bus
by broadcasting
more
“meaningful”
instructions.)
In the
second,
which
will be referred
to as MIMD
mode (for
mult,iple instruction
stream,
mulifle
data stream),
each PE
its
own
local
RAM,
executes
instructions
stored
A single
conventional
independently
of the other
PE’s.
adjacent
to the
root
of the
DAD0
tree,
coprocessor,
controls
the operation
of the entire ensemble
of PE’s.
1. Initialize:
Distribute
PE.
PE.
Set CHANGES
2. Repeat
3. Act:
Distribute
the
For
a.
each
rule
matcher
rules
to initial
elements.
WM
to
to each
in CHANGES
WM-change
Broadcast
b.
simple
a few distinct
following:
WM-change
specific
When
a D,4DO
PEsuE;ters
MIMD mode,
its logical
a WaY as
to
effectively
is changed
in
state
“disconnect”
it and its descendants
from
all higher-level
In particular,
a PE in MIMD mode does
PE’s in the tree.
not receive
any instructions
that
might
be placed
on the
tree-structured
communication
bus by one of its ancestors.
however,
broadcast
instructions
to be
Such
a PE may,
executed
by its own descendants,
providing
all of these
descendants
have themselves
been switched
to SIMD mode.
The DAD0
machine
can thus be configured
in such a way
that an arbitrary
internal
node in the tree acts as the root
of a tree-structured
SIMD device in which all PE’s execute
a single instruction
(on different
data) at a given point in
time.
This flexible architectural
design supports
multipleSIMD
execution
(MSIMD).
Thus,
the
machine
may
be
logically
divided
into distinct
partitions,
each executing
a
distinct
task, and is the primary
source of DADO’s s eed
in executing
a large number
of primitive
pattern
mate K ing
operations
concurrently.
a
each
WM
Broadcast
(add
element)
a
or
command
to
locally
PE
operates
independently
mode
and
modifies
its
is a deletion,
it checks
and
rule
If this
and
a
local
WM.
instances
modifies
as
If this
conflict
set
appropriate.
it matches
its
match.
in MIMD
its local
is an addition,
rules
delete
to all PE’s.
[Each
removes
do:
local
its set
of
conflict
set
instruction
to
accordingly].
end
C.
4. Find
do;
local
maxima:
PE
to
rate
according
to
some
resolution
strategy
each
Broadcast
its
an
local
matching
predefined
(see
instances
criteria
[McDermott
(conflict
and
Forgy,
19781).
5. Select:
Using
circuit
of
execution
Our comments
will be directed
towards
the DAD02
consisting
of
1023
PE’s
constructed
from
prototype
commercially
available
chips.
Each PE contains
an 8 bit
Intel 8751 processor,
16K bytes of local RAM, 4K bytes of
local ROM and a semi-custom
I/O switch.
The DAD02
I/O
swit,ch,
which
is being
implemented
in semi-custom
gate array
technology,
has been designed
to support
rapid
communication.
In
addition,
a
specialized
global
combinational
circuit
incorporated
within
the I/O switch
will
allow
for
the
very
rapid
selection
of a single
distinguished
PE from a set of candidate
PE’s in the tree,
call
mu-resolving.
(The
a process
we
max-resolve
instruction
computes
the maximum
of a s ecified register
in all PE’s in one instruction
cycle,
whit Tl can then be
used to select a distinct
PE from the entire
set of PE’s
taking
part in the operation.)
Currently,
the 15 processing
element
version
of DAD0
performs
these
operations
in
firmware
embodied
in its off-the-shelf
components.
from
6. Instantiate:
Figure
4.2.1
high-speed
identify
among
Report
Set CHANGES
7. end
the
DADOB,
the
to the
max-RESOLVE
a
all PE’s
single
with
instantiated
reported
rule
active
RHS
for
rules.
actions.
WM-changes.
Repeat;
3:
Discussion
Full
Distribution
of Algorithm
of Production
Memory.
1
We have left the details
of the local match
routine
unspecified
at step 3.b.
Thus,
a simple precompiled
Rete
match
network
and interpreter
may be distributed
to each
processor.
However,
it is not clear whether
a simple naive
matching
algorithm
may be more appropriate
since only a
very
small
number
of rules
is present
in each
PE.
Memory
considerations
may decide this issue: the overhead
associated
with
linking
and
manipulating
intermediate
partial
matches
in a Rete network
may be more expensive
than
direct
pattern
matching
against
the local W’M on
each cycle.
302
Performance
of
this
algorithm
varies
with
the
In the best case, the time
complexity
of the local match.
to match
the rule set is bounded
by the time to match
The worst
case is dependent
on the
only a. few rules.
maximum
number
of WM elements
accessed
during
the
If a simple
naive match
is used at
match
of the rules.
each
PE,
this
may
require
a considerable
amount
of
computation
even though
the size of the local WM’s IS
limited.
Simple hashing
of WM may dramatically
improve
a local naive matching
operation,
however.
suited
1.
We conclude
that
this algorithm
is probably
to implementing
PS programs
characterized
by:
Temporal
would
execution
3.
Many
to
though,
5.
WM
would
it
separately.
Note,
also
of
WM.
three
or
for
4.
be
PM
of
common
quite
depending
pattern
elements
large.
Even
is
resident
element
number
replicated
individual
of
on
if an
in
additional
in
other
elements
the
average
between
average
of
efficiently
implemented.
3.
above
each
PE
saving
state
Many
rules
cycle.
each
PE
local
PE’s),
may
rules,
WM
WM
of
may
unique
(while
elements
minimum
be stored
in WM.
of
6.
Global
8.
are
1000
DAD0
each
cycle,
The
allow
tests
WM
such
ability
to
to
WM
for
may
also
be
an
environment,
has
few advantages.
by
WM-changes
changes
to
many
on
WM
rules
concurrency
many
are
matchings
each
may
may
achieved
rule
is, unfortunately,
one
of the
DAD02,
for
4 of
store
the
AA,
be
potentially
at
the
PM-
to be achieved
30
Algorithm
with
a few hundred
may
be quite
configuration
each
PE
thousands
has
of
WM
for
WM
elements
point
each
we
for
32
rules
rule
For
above
would
consisting
allows
may
a 32-way
PE.
be accessed
In
system,
systems
example,
allow
of 32 PE’s.
storage
elements
at
each
appropriate.
however.
noted
would
Rule
more
the
a PM-level
PE’s
a thousand
are
large,
Since
storage,
is underutilized.
performance.
PM-level
may
PM
envisage
considerable
WM
this
Furthermore,
The original
DAD0
algorithm
detailed
in [Stolfo 19831
makes
direct
use of the machine’s
ability
to execute
in
both
MIMD and SIMD modes
of operation
at the same
point in time.
The machine
is logically
divided
into three
conceptually
distinct
components:
a PM-/eve/,
an upper
tree and a number
of WM-subtrees.
The PM-level
consists
of MIMD-mode
PE’s executing
the match
phase
at one
appropriately
chosen
level of the tree.
A number
of
distinct
rules are stored
in each PM-level
PE.
The WMsubtrees
rooted
by the PM-level
PE’s consist
of a number
of SIMD mode PE’s collectively
operating
as a hardware
content-addressable
memory.
WM elements
relevant
to the
for
WM-
device.
in size.
for rule
Thus,
rules
by the
parallel
restricted
machine
decreasing
WM-subtrees,
handled
mode
is used
example,
WM
DAD02
tree
machine.
roughly
potentially
11
rather
level of the
capacity
efficiently
as an SIMD
PM
level
indeed
a later
also
operating
DAD0
2: Original
cycles
the
access
to
For
affected
only
full
are
to
parallel
massive
affected.
subtrees
In
4.3 Algorithm
PS
efficiently.
be
WM
The most serious
drawback
of this algorithm
is the
a local WM is too large to be conveniently
case where
stored in a PE.
Clearly,
characteristic
5 is appropriate
for
this
algorithm
only
in the presence
of characteristic
9,
small WM.
Multiple
rule
firings
(characteristic
7)
A discussion
of this case is deferred
possible.
section.
on
level would
a significant
a
for
content-addressable
a
changes
between
Since
permitted
3000-4000
large
are
in
only
tests
number
of one
not
but
be
Similarly,
allows
matching,
rules.
11.
designed
Indeed,
elements
a
at
consisting
redundant.
WM
memory
each
the
stored
2
specifically
was
as:
Non-temporally
distribute
appropriate.
rules
of Algorithm
against
global
Given
Discussion
This
algorithm
programs
characterized
cycle.
Clearly,
match
4.3.1
suitable,
would
on each
Hence,
four
a
be
PE’s
matches.
potentially
possible.
possible
Thus,
it would
results
be particularly
is
characteristics,
pattern
to
local
not
PM
make
of
match
WM.
of PM,
may
number
new
small
Large
2
small
required
relatively
rules
WM
sequential
cycle.
distribution
characteristic
scope
is
each
It is probably
best to view WM as a distributed
Each
WM-subtree
PE
thus
stores
relational
relation.
tuples.
The PM-level
PE’s match
the LHS’s of rules in a
In terms
ueries.
manner
similar
to processing
relational
of the Rete match,
e’ntraconditkon
tests o? pattern
elements
in the LHS of a rule are executed
as relational
selection,
equi-join
while
intercondition
tests
correspond
to
operations.
Each
PM-level
PE
thus
stores
a set
of
relational
tests
compiled
from
the LHS of a rule
set
assigned
to it.
Concurrency
is achieved
between
PM-level
PE’s
as well as in accessing
PE’s of the WM-subtrees.
The algorithm
is illustrated
in figure 4.
best
to
of
its local
on
similar
computing
Restricted
rule
9.
partition
relatively
actively
to modify
initial
changes
amount
affected
the
that
a
PE
are
on
massive
considerable
at each
best
but
a
rules
depending
be
since
redundancy,
require
rules stored
at the PM-level
root PE are fully distributed
The u per tree consists
of
throughout
the WM-subtree.
SIMD
mode
PE’s
lying
above
t rl e PM-level,
which
implement
synchronization
and selection
operations.
the
for
Since
32
each
capacity,
many
easily
stored.
be
parallel
total,
in parallel
access
nearly
at
to
1000
a given
in time.
While attempting
to implement
temporally
redundant
systems,
Algorithm
2 may recompute
much
of its match
results
calculated
on previous
cycles.
This indeed may not
be the case if we modify Algorithm
2 to incorporate
many
of the capabilities
of the Rete match.
303
1. Initialize:
Distribute
partitioned
Set CHANGES
2. Repeat
3. Act:
the
For
a.
match
of rules
to the
each
initial
a
PE.
elements.
in CHANGES
WM-change
PE’s
and
The
match
level
PE:
phase
PM-level
PE
by
SO,
its
element
addition,
PE
is
at
and
any
If
free
identified,
WM-
and
element
occur
of
PE
the
Any
are
subsequent
that
to the
matching
pattern
(relational
for
bindings
sequentially
for
PE
rules
below
variable
reported
PM-level
the
is broadcast
WM-subtree
the
of
elements
equi-join).
A local
conflict
stored
the
an
the
deleted.
a
a PM-level
matching.
rating
If
on
selection)
are
set
of rules
along
with
in a distributed
Algorithm
is formed
a
priority
manner
within
WM-subtree.
PM-level
of
the
match
PE’s
synchronize
The
max-RESOLVE
identify
with
the
upper
circuit
maximally
the
operation,
instance
7. Set
tree.
is used
(perhaps
3: Miranker’e
TREAT
Algorithm
TREAT views
the pattern
elements
in the LHS of
rules as relational
algebra
terms, as in Algorithm
2. Thus,
the
evaluation
of such
rela,tional
algebra
tests
is also
State
is saved
in a
executed
within
the WM-subtrees.
alpha
in
the
form
of
distributed
Rete
WM-subtree
corresponding
to partial
selections
of tuples
memories
Rule instances
in the
matching
various
pattern
elements.
conflict
set computed
on previous
cycles are also stored
in
a distributed
manner
within
the WM-subtrees.
These two
the
performance
of
substa,ntially
improve
additions
that
Anoop
Gupta
of CarnegieA’gorithm
2.
independently
v e note
analyzed
a
similar
Mellon
University
Compared
to Algorithm
2,
algorithm
in
Gupta
1983.
TREAT
shoul d perform
su 1 stantially
better
for temporally
redundant
systems.
We
note
that
Gupta’s
analysis
of
depends
on certain
assumptions
that
algorithm
2, however,
derive misleading
results.)
Another
aspect
of TREAT
is the clever manner
in
Pattern
elements
are first
which relevancy
is computed.
distributed
to the WM
subtrees.
When
a new WM
element
is added
to the system,
a simple
match
a,t each
WM-subtree
PE determines
the set of rules at the PMlevel which
are affected
by the change.
Those
identified
are
subsequently
matched
by
the
PM-level
PE
rules
restricting
the scope of the match
to a smaller
set of rules
than would otherwise
be possible with Algorithm
2.
conflict
rated
the
instantiated
to the
CHANGES
root
RHS
of
the
4.4.1
set
the
reported
action
Repeat;
Figure
4:
Original
DAD0
algorithm
is outlined
in figure
5.
Discussion
of Algorithm
3
The TREAT
algorithm
is a refinement
of Algorithm
2 incorporating
temporal
redundancy.
Hence,
TREAT
is
best suited for PS programs
characterized
as:
winning
of DADO.
to
TREAT
to
specifications.
8. end
matching
the
instance.
6. Report
for
do;
termination
5. Select:
rules
Daniel
Miranker
has
invented
an algorithm
which
modifies
Algorithm
2 to include
several
of the features
of
The TREe Associative
the Rete match
for saving
state.
Temporally
redundant
(TREAT)
algorithm
[Miranker
19841
makes use of the same logical division
of the DAD0
tree
as in Algorithm
2.
However,
the state
of the previous
is saved
in distributed
data
structures
match
operation
within the WM-subtrees.
The
4. Upon
to select
updated
is performed
an
pattern
stored
and
of
is added.]
ii. Each
. ..
set
routine.
is
instances
subtree
if WM-
[If th is is a deletion,
(relational
is
4.4
PM-
local
match
probe
matching
to
its
WM-subtree
associative
this
in each
to
a partial
element
PM-level
determines
is relevant
rules
do;
the
selection)
can be used
faster than hashing).
Consideration
of these techniques
led us to investigate
Rete for direct
implementation
on DAD02.
Algorithms
3
and 4 detail this approach.
to match.
is initiated
accordingly.
111.
to
an instruction
change
end
and
PM-level
WM
WM-change
i. Each
C.
routine
to each
following:
Broadcast
b.
a
subset
1.
Temporally
redundant.
3.
Many
are
6.
Global
8.
Small
PM.
11.
Large
WM.
rules
tests
affected
of WM
are
on each
also
cycle.
efficiently
handled.
Algorithm.
changes
improve
the
Simple
may _ dramatically
For example,
rather
than
lteratmg
over each
situation.
pattern
element
in each rule as in step S.b.ii, we may only
execute
the match
for those
rules affected
by new WM
changes.
The selection
of affected
rules can be achieved
quickly
using the WM subtree
as an associative
memory.
By distributing
pattern
elements
as relational
tu les in a
associative
probing
similar
to WM,
manner
Prelational
We note, though,
that minor
changes
allow TREAT
2 directly
(b
setting
L to all of
to implement
Algorithm
the rules at the PM-level
in step 3. B .ii and ignoring
step
3.d.i).
Thus, TREAT
may also efficiently
execute:
4.
Non-temporally
In
step
3.d.iii,
redundant
TREAT
systems.
also
implements
a
useful
1. Initialize:
simple
Distribute
matcher
set of rules.
each
appropriate
the
LHS
PE.
to the
pattern
of the
rules
Set
PM-level
below)
Distribute
the
level
to
(described
and
WM-subtree
elements
a
in the
to
the
compared
to Algorithm
2 executing
temporally
redundant
study
and detailed
analysis
systems.
(Th e implementation,
of TREAT
forms a major
part of Daniel Miranker’s
Ph.D.
thesis.)
PE’s
appearing
appearing
CHANGES
PE
a compiled
root
in
PM-
initial
4.5
WM
elements.
2. Repeat
3. Act:
the
For
a.
WM-change
Broadcast
in CHANGES
WM-change
to
the
do;
WM-subtree
PE’s.
b.
If this
change
instruction
elements
d.
At
each
PM-level
PE
to
do;
to WM-subtree
to match the
instruction
PE
against
the
local
ii. Report
the
affected
pattern
the
pattern
PE’s
an
WM-change
element.
rules
and
Algorithm
store
rules
of
4.5.1
the
each
rule
1. Match
rules
in L do;
2. For
remaining
patterns
of the
1.
Temporally
in
as
2.
Few
new
in
priority
3. end
L
in
2.
each
store
Discussion
specified
Algorithm
end
for
instance
found,
WM-subtree
with
10.
a
rating.
PM.
We
1023
Rete
nodes
9.
5. Report
6. Set
the
7. end
winning
CHANGES
winning
max-RESOLVE
rule
to
in the
to
find
the
the
instance.
the
instantiated
5:
The
TREAT
tree
This
believe
processed
overlay
and
changes.
Rete
processed
that
by
only
DADOB.
technique
can
networks
be
are
Thus,
in turn.
be achievable.
since
(stored
storage
at
alpha
capacity
the
Rete
for
storage
network
nodes
intermediate
and
beta
a
DAD02
of
partial
memories),
PE
may
size of WM.
Although
overlayed
Rete networks
would be processed
significant
performance
DADOB,
sequentially
on
can be achieved
by a natural
pipelinin
improvements
a successful
match
an 3
Immediately
following
effect.
communication
at a node, the next two-input
test from the
overlayed
network
is initiated.
Thus,
while
the parent
node is processing
the first network
node, its children
are
proceeding
with their tests of the second overlayed
network
node.
Algorithm.
strategy.
When
iterating
over
each
of the rules
in L
affected
by recent
changes
in WM, those pattern
elements
with the smallest
alpha memories
are processed
first. This
technique
tends
to process
the join operations
quickly
by
filtering
out many potentially
failing partial
joins.
As
algorithm,
Miranker
instance,
several
However,
restricting
WM
of the
instance.
Repeat;
Figure
of the
programs
19791.
be
forward
significant
results
require
RHS
may
limited
for
may
where
WM.
match
tree.
may,
in the
PM
Small
implementation
for
PS
by
in [Forgy
a straight
embedded
instance
4
affected
Large
implemented
each;
rated
are
is noted
However,
do;
do;
Use
maximally
6.
redundant
observation
require
4. Select:
of Algorithm
rules
large
e.
in figure
Since this algorithm
is a direct
Rete
match,
it
is
most
suitable
characterized
as:
in L appropriately.
iv. For
v. end
elements
4 is illustrated
in
L.
iii. Order
Rete
This
observation
leads
to Algorithm
4 whereby
a
logical
Rete network
is embedded
on the ph sical DAD0
In the simplest
case, i eaf nodes of
binary
tree structure.
the DAD0
tree store and execute
the initial
linear chains
whereas
internal
DAD0
PE’s
of one-input,
test
nodes,
operations.
The
physical
two-input
node
execute
connections
between
processors
correspond
to the logical
da.ta flow links in the Rete network.
The entire
DAD0
in MIMD
mode
while
executing
this
machine
operates
algorithm,
behaving
much
like
a pipelined
data
flow
architecture.
set
cycles.
to PM-level
Phase.
i. Broadcast
WM
conflict
on previous
Match
an
delete
affected
an instruction
the
broadcast
and
any
calculated
Broadcast
enter
match
and
instances
c.
is a deletion,
to
4: Fine-grain
A Rete network
compiIed
from the LHS’s of a rule
set consists
of a number
of simple
nodes encoding
match
Tokens,
representing
WM modifications,
flow
operations.
through
the network
in one direction
and are processed
by
Fortunately,
the
each node lying on their traversed
paths.
maximum
fan-m
of any node in a Rete network
is two.
Hence, a Rete network
can be represented
as a binary
tree
(with some minimal
amount
of node splitting).
following:
each
Algorithm
source
of
ninelining
can
improve
A
second
In this cker
the &tire
RHS action
performance
as well.
specification
is broadcast
at once to the DAD0
leaf PE’s
Immediately
following
the conclusion
of the
at step 3.a.
first match
operation
and communication
of the first WM
noted
above,
Gu ta’s analysis
of a TREAT-like
as well as su f sequent
analysis
performed
by
[1984],
show
TREAT
to
be highly
efficient
305
1. Initialize:
Map
network
on
provided
with
and
the
load
DAD0
the
the
compiled
Each
tree.
appropriate
network
information
details).
Set
match
code
[Forgy
(see
CHANGES
Rete
node
is
and
19821
to
for
initial
WM
1. Initialize.
2. Repeat
the
For
in CHANGES
WM-change
to
Distribute
do;
of the
a.
b.
Broadcast
WM-change
the
leaf PE’s.
DAD0
Broadcast
an
one-input
The
token.
for
root
of
are
tests,
the
The
to
the
to the
PE
is provided
the
with
control
the
the
chosen
Set
processor.
instantiated
completion
SIMD
of
PE
of
PS
(PS-level),
Algorithm
program
each
PS-level.
instruction
in
their
to each
PS-level
PE
MIMD
mode.
(Upon
each
respective
reconnects
to
programs,
the
tree
above
PE’s
are
in
to
in
mode.)
3. Repeat
a.
2.
to
the
following.
Test
if
all
PS-level
SIMD
mode.
End
Repeat;
4. Execution
Complete.
Halt.
Figure
7:
Simple
Multiple
PS Program
Execution.
RHS.
In the cases where
various
PS-level
PE’s need to
communicate
results with eachother,
step 3 is re laced with
appropriate
code sequences
to report
and broa a cast values
from
the PS-level
in the proper
manner.
Each
of the
programs
executed
by PS-level
PE’s are first modified
to
synchronize
as necessary
with the root PE to coordinate
the communication
acts, at, for example,
termination
of the
Act phase.
Repeat;
Figure
6:
Fine-grain
Rete
Algorithm.
token,
the leaf PE’s initiate
processing
of the second WM
Hence, as a WM token
flows up the DAD0
tree,
token.
subsequent
WM tokens flow close behind
at lower levels of
the tree in pipeline fashion.
4.0
execution
incorporate
DAD0
processor).
root
begin
to
physical
changes
in
at the
an
the
immediately
until
reports
the
to
which
PE’s
appropriate
2. Broadcast
PS-level
PM-level
the
the
DAD0
System-level
their
passing
two-input
maintained
new
waiting
by
tokens
repeated
DAD0
execute
the
communicated
their
set
from
CHANGES
4. end
idle
to
do;
The
instance
processors
computed
Those
is then
control
Select:
nodes
ancestors
conflict
end
on
processing
process
C.
sequences
lay
to
PE’s
leaf
tests
immediate
begin
all
the
results
descendants.
to
token)
test
interior
match
one-input
Rete
instruction
(First,
Match.
their
(a
divide
Production
similar
following:
each
Logically
static
a
elements.
3. Act:
algorithmic
process
executed
in the
controlled
by some
case is represented
upper
part
of the tree. The simplest
by the procedure
illustrated
in figure 7, which is similar in
some respects
to Algorithm
2.
Algorithm
In our
characteristic
as
5: Multiple
discussion
so far,
7, multiple
rule
- multiple,
independently
Asynchronous
Execution
no mention
was made about
We may view this
firings.
executing
PS
programs,
or
- executing
PS program
multiple
conflict
set
rules
of the
same
concurrently.
In this regard
we offer not a single algorithm,
but rather
an observation
that may be put to practical
use in each of
the abovementioned
algorithms.
We note that
any DAD0
PE may be viewed
as a
Thus,
any algorithm
operating
root of a DAD0
machine.
at the physical
root of DAD0 may also be executed
by
Hence, any of the aforementioned
some descendant
node.
algorithms
can be executed
at various
sites in the machine
concurrently!
(Th is was noted in [Stolfo and Shaw 1982 .)
This coarse
level of parallelism,
however,
will need to IIe
In addition
to concurrent
execution
of multiple
PS
programs,
methods
may
be employed
to
concurrently
execute
portions
of a single PS program.
These methods
are intimately
tied to the way rules are partitioned
in the
tree.
Subsets
of rules
may be constructed
by a static
analysis
of PM separating
those rules which do not directly
interact
with each other.
In terms of the match
problemsolving
paradigm,
for example,
it may be convenient
to
think
of
independent
subproblems
and
the
methods
implementing
their solution
(see [Newell 19731). Each such
method
may
be
viewed
as
a
high-level
subroutine
represented
as an independent
rule set rooted
by some
internal
node of DADO.
Algorithm
1, for example,
may
be applied
in parallel
for
each
rule
set
in question.
Asynchronous
execution
of these subroutines
proceeds
in a
straight
forward
manner.
The complexity
arises when one
subset
of rules
infers
data
required
by other
rule sets.
The coordination
of these communication
acts is the focus
of our
ongoing
research.
Space
does
not
permit
a
complete
specification
of this
approach,
and
thus
the
reader
is encouraged
to see [Ishida 1984) for details of our
initial thinking
in this direction.
5 Conclusion
References
We
have
outlined
five abstract,
algorithms
for the
parallel
execution
of PS programs
on the DAD0
machine
and indicated
what characteristics
they are best suited for.
We summarize
our results
in tabular
form as follows:
Algorithm
1. Fully
PS
Distributed
2. Original
3. Miranker’s
4. Fine-grain
5. Multiple
PM
DAD0
TREAT
Rete
Asynchronous
Davis,
and
J.
Forgy,
1, 3, 5, 7, 9, 11
C.
L.
Mellon
1, 3, 4, 6, 7, 8, 11
1, 2, 5, 7, 9, 10
Forgy
to all cases.
the
On
Ph.D.
C.
L.
Gupta,
A
Science,
Carnegie-Mellon
and
University,
and
Report,
Columbia
University,
University,
J.,
and
J.,
and
Report,
Department
University,
Computers).
307
D. P.
Maclaine:
1984.
a
Shaw.
Computer.
Science,
DADO:
of
Technical
Columbia
A
Tree-structured
Production
Computer
Artificial
August,
The DAD0
System-level
Systems.
on
University,
(Submitted
Technical
to AI Journal).
Conference
of
Programming
Thesis.
for
Miranker.
Control
Department
(Submitted
Carnegie-Mellon
System
as
Computer
National
Intelligence,
S.
E.
of
Intelligence.
Parallel
1983.
RETE.
Science,
1973.
Ph.D.
of
DAD0
Information
Vi’aual
University,
Architecture
Proceedings
Stolfo
D.
Computer
Models
(Ed.),
Systems
DAD0
the
and
1984.
Press,
1976.
August
Machine
of
Artificial
Department
for
TREAT
of
Systems:
Academic
The
J.
Systems,
Estimates
Chase
Science,
Report,
Conflict
Hayes-Roth
Inference
Carnegie-Mellon
Computer
S.
System
and
April
W.
for
Report,
Science,
Production
Department
Production
M.
Computer
Waterman
Comparison
In
Language
S.
In
Production
A.
Rychener,
of
of
Machines.
1978.
Technical
Processing,
Stolfo
Forgy.
A
Structures.
Firing
1984.
Machine.
Newell,
on
Computer
1983.
Simultaneous
Performance
P.
Systems
of
Tree-structured
on
Many
Problem.
Production
Department
Pattern-directed
D.
the
19, 17-37.
Stolfo.
C.
Press,
for
University,
Strategies.
(EW,
Academic
Science,
Matching
Department
Report,
Columbia
Resolution
Stolfo
J.
Technical
Miranker
1982,
Rules
J.
Carnegie-
Computer
Algorithm
Report,
S.
Production
McDermott,
of
OPS5
Technical
T.,
Report,
Pattern
Intelligence,
DA-DO.
Ishida
In this
paper,
we have
outlined
our
expectations
concerning
the suitability
of each of the algorithms
for a
variety
of possible
PS programs.
We expect ouirtreenyrted
findings
to
substantiate
our
claims,
and
to
demonstrate
this with working
examples
in the near future.
Fast
Object
Implementing
A.
of
Implementation
Technical
Department
Rete:
Artificial
Computer
Thesis.
Pattern/Many
Of the five reported
algorithms,
only
the original
DAD0
algorithm
(number
2) has been carefullv
studied
analyticallj;.
The ‘performanie
statistics
of the ;emaining
four
algorithms
have
yet
to
be
analyzed
in detail.
However,
much
of the performance
statistics
cannot
be
analyzed
without
specific
examples
and
detailed
implementations.
Working
in close
collaboration
with
researchers
at AT&T
Bell Laboratories,
in the course
of
the next year of our research
we intend to implement
each
of the stated
algorithms
on a working
prototype
of DADO.
of
1975.
Efficient
Systems.
Production
of
Department
University,
University,
1979.
Overview
Report,
Stanford
Production
3, 4, 6, 7, 8, 11
An
King.
Technical
Science,
Characteristics
Applies
R.
Systems.
1982.
Production
Details.
Technical
Science,
Columbia
to IEEE
Transactions
on