December, 1982 LIDS-P-1265

advertisement
December, 1982
LIDS-P-1265
OPTIMAL FILE ALLOCATION PROBLEMS FOR
DISTRIBUTED DATA BASES IN UNRELIABLE
COMPUTER NETWORKS*
by
Moses Ma
Michael Athans
Laboratory for Information and Decision
Systems
Massachusetts Institute of Technology
Cambridge, Massachusetts
02139
ABSTRACT
This paper deals with the problem of optimally
locating files, and their optimum number of
redundant copies in a vulnerable communication
network.
It is assumed that each node and
link of the communication network can fail
independently. The optimization problem maximizes the probability that a commander can
access the subset of files that he needs
while minimizing the network-wide costs related to storage, query and update communication
costs.
The problem reduces to a linear zeroone integer programming one; several theorems
that reduce its complexity of solution are
presented.
1.1
INTRODUCTION
This. paper focuses on the problem of optimal
redundant file allocation for a very vulnerable distributed data base system. This
file allocation is different than previous
file allocation problems because it considers
the following new items:
1.
2.
3.
Vulnerability of the nodes and links due
to enemy actions, e.g. jamming
The importance of the users
The importance of particular files to
particular users.
First we shall discuss the motivation for the
problem of optimal file allocation in a vulnerable environment. Second we shall explain
the problem and its constraints.
Third we
shall discuss the possible trade-off in costs.
Fourth a brief literature survey will be
presented. Fifth the actual formulation wi
be explained. Sixth the various theorems
that have been developed for this formulation
will be stated and explained in words; however, we do not include the theorem proofs.
Seventh the conclusions and suggestions for
further research will be presented.
1.2
MOTIVATION
through its organic sensors. This information
must somehow be stored and maintained to the
utmost correctness because the BG must coordinate its actions. The system may be considered as a distributed data base system.
The ships and planes can be considered as
nodes, and the communication channels between
the ships and planes can be considered as
links. The data can be considered as files
stored in the computers of the ships and
planes.
Considering the BG as a set of nodes,
links and data files, we have defined for
ourselves a distributed data base network.
Since in warfare, ships can be destroyed and
communication links jammed, our network is
vulnerable. Therefore we must consider how
to maintain a consistent and complete data
base.
If we also consider the individual warfare commanders and the data files they need,
the problem becomes more complex.
We can
also rank the importance of each commander
and the importance of each data file to each
commander and include this in our optimization problem.
The problem is therefore as follows-we
are given the following:
1. A set of M data files
2. A set of N nodes to store the data files
3. The probabilities of any node of link
being destroyed from which the probabilities of any particular commander at one
node can access any particular file at
another node.
4. A set of L commanders.
5. The costs of assigning a particular file
particular node.
6. to
The any
query
rates for any particular file
emanating from any particular node. The
emanating from any particular node. The
query rate is the rate at which files are
requested.
7. The update rates for any particular file
emanating from any particular node.
The
emanating from any particular node. The
update rate is the rate at which files
are updated (changed).
The motivation behind our problem is in the
3
C
(Command, Control, and Communications)
context. In this context we are considering
a Naval Battle Group (BG) composed of aircraft carriers, cruisers, destroyers, airWe desire the
craft, etc. Th tr
1. To locate single following:
or multiple copies of
craft, etc. The BG gathers information
*Th' 3research was supported by the Office of Naval Research under contract ONR/N00014-77-
C-u6 Z (NR 041-519).
characterized by a very great number of variables even for application of limited dimensions; its solution was extremely hard from
a conceptual viewpoint.
Casey [21,[7] considered the problem of
allocating single files separately, but the
number of copies of each file was not assuned
to be fixed. Communication costs and storage
costs of allocations were analyzed in order
1.4 TRADE-OFF OF COSTS
to determine the optimal set of nodes on
which the file was to be allocated. The difference between retrieval and update transThere is a trade-off in costs if we
actions was stressed; while retrieval transconsider the following costs:
actions are routed to only one copy of the
file, update transactions are routed to all
1. query communication costs
the copies, in order to preserve consistency
2. update communication costs
Under the assumpof redundant information.
3. storage costs
tion of taking equal cost rates for retrieval
4. the cost to the BG if a particular comand updates, theorems were given for limitirng
mander does not have access to the parthe number of replicated copies of the file
ticular file he desires.
on the basis of the update/query ratio;
To minimize any one of the four costs we can
do minimizethe
followingy oneofthefourcostsweobviously the convenience of taking replicated copies decreases while the update/query
ratio increases.
Although the file alloca1. To reduce the query communication costs,
tion problem was analyzed for each file
we can store more redundant copies of
each file so teat each query can find its
separately, Eswaran [3] proved that Casey's
each
file
so that each querycommunication
canost. This
formulation was NP complete and, therefore,
the M files at the N nodes such that the
files will be accessible to the commanders
who need the files.
2. To locate the files at nodes that will
provide the least amount of cost. The
cost can be related query communication
costs, update communication costs and file
storage costs.
is true
true since
since we
shall assume
assume that
that each
suggested to investigate heuristic approaches.
is
we shall
each
Morgan and Levin
[4], examined both the
and transactions within
allocation
query goes to the nearest node containing
allocation of files and transactions within
the file.
a generalized, ARPA-like network. They adop'
2. To reduce the update communication costs,
ted the user's viewpoint, assuming to be
we can store fewer redundant copies of
eachsofile
that each update will update
under the jurisdiction of a network manageeach file so that each update will update
ment providing services at the market price.
fewer files and incur less communication
Because of this characterization of applicacosts. This is true since we assume each
tion environment, storage capacity constraints
update goes to all the nodes containing
update
the
file.
goes to all
were not included; the provision of sufficient
the file.
storage was considered a task of the network
3. To reduce the storage costs, we can store
management. Therefore, by introducing some
fewer redundant copies of each file so
fewthe
cost will decrease. ofeachfilesoother
simplifying assumption, the authors
4The ost w d edemonstrated that the multiple file alloca4. To reduce the cost of non-accessibility
of files
particular
to particular com
tion problem could be decomposed into independent (single) file allocation problems;
manders, we can store more redundant
they also developed an heuristic solution
'anders,
e
we can s
copies of that particular file so that it
nique.
tec
has a higher probability of being able to
Finally two contributions to the file
be accessed by the particular commander.
allocation problem have been very recently
presented. Ramamoorthy and Wah [51 analyzed
The bottom line is: we cannot increase
a relational Distributed Data Base; they oband decrease the number of redundant copies
served that the general approach of query
of a particular file. We would like to find
processing optimization consists in the minthe optimal number of redundant copies for
all the files and where to store them.
imization of communication costs. These canmunication costs are mostly due to data moves
.which are necessary for providing the logical
2. LITERATURE SURVEY
correlation, expressed by the query, between
files stored on different nodes. A logical
The file allocation problem was first
operation which is particularly critical is
investigated by Chu [1]; a global optimizathe join operation between remote files; a
tion was considered, consisting in obtaining
join between two files can be performed only
the minimum overall operating costs subject
if the two files are co-located at the same
to two kinds of constraints; first, the exnode. Therefore the authors developed a
pected time to access each file had to be
model in which redundant files are introduced
less than a given delay, and secondly the
in order to avoid distributed joins, on the
amount of storage needed at each computer
basis of the frequency of queries.
had not to exceed the available storage
capacity. The number of copies of each file
3. NOTATION
was assumed to be fixed. A generalized model was defined, in which storage and transth
mission costs were associated to file allocab. = the available memory size of the i
computer
tions; channel queues were modeled in order
to introduce the constraint on the delay.
x.. = the file j is stored at node i
The resulting linearized integer program was
=
Ni
node i
r(Ni ) = set of nodes that are directly linked
to node i
x21 + x2 2 +...+ X21 < b2 ;
r(Ni
r (Ni)= rF...
xN1
)
R(N i) = set of nodes that are accessible to
node i by a path
R(N ) = F(N.)Ui
i
2
0
(l-xiM)xkMaiMk<Ti
~ifVk
iatk
sk royed
kj = the cost of locating a copy of a file
j at the kth node
T.. = the maximum allowable query traffic
time of the jt
file to the i
node
tmoftejnodeIfflto
= the expected time for the i
node to
query th
th
query
setofnej
the
file
from the k
nodea
= the set of node indexes representing a
th
i
we now examine the last term in the
minimization, we can simplify the expression.
The expected value may be brought inside the
summation. Since the importance of the commander i and the importance of file
2. to ccm
mander i are not probabilistic, we can simply
take the expected value of the accessibility.
However, the expected value of the accessibility is simply the probability that commander
i can access file £ given the allocation of
redundant copies of file Z in the network.
We have:
FORMULATION
Sinceis xi
a zero-one variable, the
sum over all nodes i must be equal to the
number of redundant copies of file L.
Therefore we have:
rQ
(4.1)
IZ (R(N
i )} = S.A (U)
(4.2)
E
x.I
=
M
at node k which was requested by node j,
where each node k is a node that has file Z.
The second term denotes the cost of querying
file j at node k which was requested by node
j, where node k is the closest node containing
file Z. The third term represents the cost
of storing file Z at node k. The last term
denotes the cost associated with the expected
accessibility of file £ to the commander at
node i weighted by the importance of commander i.
mander i.
The first set of constraints state that
number of files stored at any node must
be less than the capacity at each node. The
second constraint states that the expected
time to retrieve a query is less than a
certain threshold quantity. The last constraint states that all the zero-one variables
.~ ..
= the volume of update traffic emanating
1,ji
- the volume of update traffic emanating
from node j for file 1
i
of
from
cthe
nodet ofor fnil
jk
the cost of a unit of communication
4.
bN
x > 0
where the first term in the minimization
corresponds to the cost of updating file l
the number of redundant copies of file
j stored in the system
X.. = the volume of query traffic emanating
1
31
from node j for file
i
<
i2)k2ai2k- i2
r.
a.i
xNM
(N
i ) y ...
if 3k s.t. j at N eR(N )
if 3 s.t. kj at Nk R(N )i
if 3k s.t. j at NkeR(Ni)
0
+
(1-Xil)xklailkTil;
8. = the value of file j to commander i
..
j1
P. .(I) = the probability that file j is ac]3
cessible to the commander at node
i given an assignment I
A.(i)=
xN2
and
the importance of commander i
(1
+
i
(
1
R
))|
A (7 )
a
E
We defineiE
Zi
=
which denotes the accessibility of file L to
the commander at node i weighted by the importance of file Z to the commander at node i.
The initial formulation is as follows:
M
m
M4
= min
m in Xdk+
j=l keI Z
L
akxk-E
I
iP'(I
)
it
l
(4.4)
P1i,(I)
-
1
if Pr(3k s.t.
at NkR(Ni))
0
if Pr(3k x.t.
eat NkeR(Ni))
0
if Pr(Vk s.t.
aat Nk,Nk
E a.I(R(N
k=l
k=l
destroyed)
(4.5)
such that
X11 + x12 +
Li
E
jl kl
N
N
Ai(
A ()
where
N
E
E
i=i
x 1...+
bl;
Where Pi (It) for one file Z is by definition:
N
iQ(
N
j=l
N
min
k=l
kfj
N
N
+ h
h=l
M
l (l-xk)x.P.
1
jj=l
j>k
(1-k)x.xh[pj .+(l-p.j)Pih.
.=l1
k=l
k#j
kfh
NN
I
=l
min
M
:I
I
+ a=l
.
b=l
b>a
N
n
"
j=l
j>h
ksb
j>a
kfa
(l-x .)x
]
N
n
I
I x P.
j=l m=l m
mfj
+ (-1)
ifk
M
C(I)
min
I
I
=l
(4.11)
min C(I)
=
rI Xm Pim
l m
5.
is the probability of accessibil-
THEOREMS
First lets look at allocating one file
Z optimally by placing redundant copies at
different nodes, so without loss of genrality
let I=I
.
Theorem I:
If
I=pX
for j=l,2,...,n then
J
an r-node assignment cannot be less costly
than the optimal one-node assignment if
M
min I
C(I)=
I Q.=l
N
L
j~
1)
N
and
jE
Ik ijk jk
k~ 1
k+min
ke
jd jk
k~lk~I~
2)
N
(I
(4.8)
such that
M ..
.
ij b.;
j= 1
N
(r1)
+ N min)d
j
r
jk
j=l k
r
k=2
1
1
pr k=2
Theorem I states that if we have allocated r-l copies of a file and the two
inequalities hold, then by allocating the
rth file, our total cost will not be less
than just allocating one file optimally.
This will allow us to reduce the possible
space in which the integer program
<
li<N-
(l-x. j)x.ak<T..;
13)1.
jk))X1i )
13
+(
l
it
i
(l+0)
jr
+
:iaaPi
r
(Pr-Z-1)(l+p)
jpr
k
L
-
(5.1)
p> 1
- r-l
:z LJ
N
+I o k
k
k=l
l<j<M
L
N
1]
I
I
£Z~
l<i<N
Lets now try to minC(IQ) for a particular Z. The following of theorems will set
bounds on how to allocate the files and
determine when not to allocate files.
ity between nodes i and j.
Substituting this back into the initial
formulation we now have:
M
(4.10)
If we remove the constraints, we are minimizing over disjoint sets, so we have
(4.7)
min
iiPiiz]
<T;
L
M
j=l k=l m=l
k>j mfj
mfk
N
N
where P..
+
jk
l
;
b
a.
L]
~N
N
d
j
(4.6)
j·i
I
Zp
x .<b.
i~~j
-
I
which simplifies to the following:
N+lN
i=
kI
M
+
+ (-1)
L
1 -
X
kd +min
such that
k=l
k#j
j>b
N+1
i
kei
x [Pij
(1-P j)P +(l(1-P.)(l1-P )Pi '']
13
ih ig
k ij
ijih
+()
N
I
ak
I
kgh
=
=
j=l kII
+
NN
C(I)
ifk
lij<M
-solution
must search. Now we can exclude all allocations with more than r-l files from
possible file allocations before execution
of the integer program.
X>O
We know that
N
I
k=l
(anything)xgk=
:
keIZ
(anything)
(4.9)
So substituting into the previous formulation we have:
Theorem II: If for some integer r<n,
r-l
and
N
2)
r
p<(pr-X-l)(l+p)
d + (l+p)
E 6k
r
j=lj jl
r
k=2 k
k=2
(l+p)
N
I mn
p
j=1 k
(r-l)
a
djk
1
+ _
r
k
E
k=2
(5.4)
then any r-node file assignment is more costly than an optimal one-node assignment.
Theorem II states that if we have allocated r-l copies of a file and certain
conditions hold, then by allocating the rth
file, our total cost will be more than allocating one file optimally. This will allow
us to reduce the possible solution space in
which the integer program must search. Now
we can exclude all allocations with more
than r-l files from possible file allocations
before execution of the integer program.
Define uk as follows:
the sequence of costs encountered along any
one of these paths decrease monotonically.
Thus in order to find the optimal allocaticn
policy, it is sufficient to follow every
path of the cost graph until the cost increases and no further. This will given a
locally optimum allocation of which the
global optimum is one of them.
This allow us to reduce the solution
space of the integer program. Once we find
a local optimum then we know that any more
file allocation is not required. Hence the
integer program will not have to search for
solutions in that part of the solution space.
Theorem V:
include site i
where
k + ¥k +
=
_-
Y
k
=
k
I
j=-l
-(IU
(5.12)
N
Zi =
dj
if
i min d.. >Z.,
ji
where
N
Uk =
All optimal allocations will
d.
+ Yi +
i
(5.5)
(5.13)
j=
Theorem VI:
No optimal allocation including
more than one site will include site i if
()(5.6)
)
the following is true:
Then the cost function for any given assignment I is given by:
N
C(I) =
uk + I X min dk .
]
keI
j=l
keI
j
N
Zi
(5.7)
d>
X (maxd -d i.
j=l
k
jk
ji
(5.14)
Theorem V states that if the cost of*
Theorem III:
If
having a local file copy is smaller than
the smallest possible cost of sending
then a local copy should
queries elsewhere,
(5.8)
for k=1,2
C(I)< C(I-[kl)
C-I)<for k~l,2
'(I-tk])
(5.8)unquestionably be included in the optimal
allocation. This will require certain nodes
to have files allocated there. Therefore
C(I~Ck])< C(I~[1,2])
for k=1,2 (5.9)
the solution space required to be searched
by the integer program will be reduced.
Theorem III states that our cost graph if
The integer program can ignore any possible
the given vertex has a cost less than the
file allocations that excludes the files
cost of two vertices leading to it, then
that are unquestionably allocated by
the cost of the predecessor of the two
Theorem V.
vertices is greater than either of the two
Theorem VI states the other extreme
vertices.
of Theorem V.
If it costs more to maintain
a local copy than we could possible save by
Theorem IV:
Given an index set XC I, conhaving one, then we do not want one. This
taining r elements with the following
theorem will allow the integer program to
ignore allocating files in locations that
property:
are definitely too costly and therefore
C(I)< C(I[xl])
for each xCX
(S.10)
reduce the possible solution space for the
C-I)<
C-Clix])
fr each xX
(5.10)
integer programming solution. The integer
(1
(r)
(2)
program can ignore file allocations which
'
Then for every sequence R (1),R 2) ,..,R
include file allocations that are ruled out
which are subsets of X, such that R(k) has
k elements and R
is true:
by Theorem VI.
Define the following:
CR(+l), the following
m. =
1
C(l)<C(I-R (1))<C(I-R(2))<C(I-R (3))<.. C(I-R(r))
(5.11)
Theorem IV states that if a given vertex has a cost less than the cost of any
vertex along the path leading to it, then
. min d..,
i
1)
(5.15)
and
M. =
I X. (max d. -d..)
.
(5.16)
j1
Then for each i the real line is partitioned by mi and Mi into three regions.
If Z. falls into region I then it should
unquestionably be included.
If Z. falls
into region III it will be excluded unless
all Z. fall in region III, then just include
If Z. falls in region II
the largest one.
then it must be further considered.
These theorems are useful because now
the region in which the integer program
must search for solutions is reduced. We
can force files to be allocated in region I
and not be allocated in region III.
Theorem X:
I"' =IU i].
N
N
[N
xk Z
[N N Y
+
i
k=
i=l k=l
M
Z=i
N
L
l<k<N
ksi
(5.17)
(5.18)
file allocation for this special type of
network.
Given a node k and a file j,
Theorem VIII:
then to store a copy of j in k leads to a
reduction of the overall costs if the following holds:
N
+kj
,.
,
>
+ a+cj.
_Y
(5.19)
i=l
For allocating a new copy later, the
theorem states that if allocation of a new
copy leads to a cost decrement for the host
node which is greater than the overcost due
to the necessity of updating the additional
solution to the file allocation problem
(5.23)
I'=IU[k] and I"=I'U[i].If site i satisfies
(5.20)
for some site k in the network, then
C(S") >C(I').
Theorem IX states that if site i is
sufficiently costly, then by adding site i
to an allocation which already includes k
increases the total cost.
A site i cannot be included in
].
(5.24)
Theorem XI states that instead of determining that no more than one of some group
of geographically close sites can be included
in an optimal solution, Theorem XI states
that certain sites may be excluded from being
optimal allocations by the existence of
better nearby sites. This theorem is useful
because it allows us to reduce the possible
solution space of the integer program. The
integer program can ignore file allocations
that allocate separate copies of the same
file at geographically close nodes.
Theorem IX, theorem X, and theorem XI
are extensions of work done by Grapa and
Belford [6].
Theorem XII:
P. (I) =
The following are equivalent:
N
N
I
n
j=l
k=l
k#j
(1-x)xjPij
(-x )x x
+
h
h
h
j
1
j
k1
j>h k~j
kh
N
Define the following allocations:
N
> I X max[(djk-dji),0],
jk
j=l
.,O]
ji
N
Zi-Zk >
X.max[djk.
-d
i
k
j=l 3
jk ji
does not require integer programming and
It enables
therefore is not NP complete.
simple calculations to determine file
allocation.
Zi
-d
any optimal allocation if there exists
another site k in the network such that
copy and storage cost, then we should store
the file there.
This theorem is useful because the
Theorem IX:
(5.22)
N
-Z
> I X maxtd
i k j=-l a
1]
kj + kj
(5.21)
Theorem X states that if:
The decision rule for the initial file asif:
signment is xij=l
Xij +ij
),0],
ji
C(I''')>C(I')
Theorem XI:
1
.iiPi
jk
is satisfied, then replacing site k by site
i in an allocation will increase the cost.
X kl +
(IZ
XAk.(l-XkZ)-i
k=l
i=l
E
-d
Xmax[(d
j=l
k
By choosing djk=l,
kl=
and
jk
kl 1
a completely connected network, then the
cost function reduces to the following:
N
If sites i and k satisfy
N
Z -Z >
Theorem VII:
C(IZ)=
Define the following allocations
N
N
+
..
a=l bl
b>a
j
ij
i
ij
ih
N
i
Xk[ij +(-p ij)p ih
j=l k=l
j>h k#j
.j>b
j>a k'b
kjb
kfa
+ (1-p j)(l-Pih)pi.
( 1-ij)
ih )ig
and
x
N=
i
Z Xj iJ
N+1
..
N
N
N
+ -k-l
I
N+ fmj(-)
SJCC (1972).
3. K.P. Eswaran, "Placement of Records in a
File
p - and
..File Allocation in a Computer
Network," Information Processing, North
Holland (1974).
4. H.L. Morgan, J.D. Levin, "Optimal Program
XmPim
m m
and Data Locations in Computer Networks,"
m=kj
k3j
m
N NN
( )NE
x: P ++(l,)
( - 1 )N
(-1)N1 Z
nP X~im
jm .m
M
j=l ml
mj#j
5.
+I
N
N xN+lP
~rxpi
mm iM
m=l
(5.26)
6.
where p.. is the probability of accessibility
1
between Aodes i and j.
This theorem essentially states a
generalization of the well known probability
law:
P(ABC)=P(A)+P(B)+P(C)-P(AB)-P(BC)-P(AC)+P (ABC).
(5.27)
6.
CONCLUSIONS
We have formulated the file allocation
3
problem in a C context where vulnerability
is an issue. The formulation considers:
1. The probability of commander accessing
files;
2. The importance of commanders;
3. The importance of particular files to
particular commanders.
The theorems have provided ways to cut down
on the possible file allocations (solution
space) in which the integer program has to
Therefore, we reduce the amount
search.
of time required to solve for a solution
using integer programming.
We have extended and proved twelve
applicable to the new formulatheorem, all
tion.
3
In the C context, we may not need inger programming to solve for a solution if
we make the following assumptions:
1. Connected network where all nodes are
connected to each other;
2. Cost of communication is same between all
nodes;
is the same.
3. Cost of storing a file
In the area of further research, we plan to
explore the effects of where the data sources
allocation problem.
are located on the file
3
This would be applicable in a C context,
where sensor data may come from only a fixed
The data must also pass
set of nodes.
The location of
through a processing node.
where the processing node should be is also
an optimization problem which can be incorporated into our formulation.
REFERENCES
1. W.W. Chu, "Optimal File Allocation in a
Multiple Computer System." IEEE Transactions on Computer, Vol. C-18, No. 10
(1969).
2. R.G. Casey, "Allocation of Copies of a
File in an Information Network," AFIPS,
7.
CACM, Vol. 20, No.5 (1977).
C.V. Ramamoorthy, B.W. Wah, "The Placement of Relations on a Distributed
Relational Data Base," Proc. 1st. Int.
Conference on Distributed Computing
Systems (1979).
E. Grapa and G.G. Belford, "Some Theorems
to Aid in solving the File Allocation
Problem," CACM, Vol. 20, No. 11,
pp. 878-882 (1977).
R.G. Casey, "Allocation of Copies of a
File in an Information Network," AFIPS,
SJCC (1972).
Download