December, 1982 LIDS-P-1265 OPTIMAL FILE ALLOCATION PROBLEMS FOR DISTRIBUTED DATA BASES IN UNRELIABLE COMPUTER NETWORKS* by Moses Ma Michael Athans Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge, Massachusetts 02139 ABSTRACT This paper deals with the problem of optimally locating files, and their optimum number of redundant copies in a vulnerable communication network. It is assumed that each node and link of the communication network can fail independently. The optimization problem maximizes the probability that a commander can access the subset of files that he needs while minimizing the network-wide costs related to storage, query and update communication costs. The problem reduces to a linear zeroone integer programming one; several theorems that reduce its complexity of solution are presented. 1.1 INTRODUCTION This. paper focuses on the problem of optimal redundant file allocation for a very vulnerable distributed data base system. This file allocation is different than previous file allocation problems because it considers the following new items: 1. 2. 3. Vulnerability of the nodes and links due to enemy actions, e.g. jamming The importance of the users The importance of particular files to particular users. First we shall discuss the motivation for the problem of optimal file allocation in a vulnerable environment. Second we shall explain the problem and its constraints. Third we shall discuss the possible trade-off in costs. Fourth a brief literature survey will be presented. Fifth the actual formulation wi be explained. Sixth the various theorems that have been developed for this formulation will be stated and explained in words; however, we do not include the theorem proofs. Seventh the conclusions and suggestions for further research will be presented. 1.2 MOTIVATION through its organic sensors. This information must somehow be stored and maintained to the utmost correctness because the BG must coordinate its actions. The system may be considered as a distributed data base system. The ships and planes can be considered as nodes, and the communication channels between the ships and planes can be considered as links. The data can be considered as files stored in the computers of the ships and planes. Considering the BG as a set of nodes, links and data files, we have defined for ourselves a distributed data base network. Since in warfare, ships can be destroyed and communication links jammed, our network is vulnerable. Therefore we must consider how to maintain a consistent and complete data base. If we also consider the individual warfare commanders and the data files they need, the problem becomes more complex. We can also rank the importance of each commander and the importance of each data file to each commander and include this in our optimization problem. The problem is therefore as follows-we are given the following: 1. A set of M data files 2. A set of N nodes to store the data files 3. The probabilities of any node of link being destroyed from which the probabilities of any particular commander at one node can access any particular file at another node. 4. A set of L commanders. 5. The costs of assigning a particular file particular node. 6. to The any query rates for any particular file emanating from any particular node. The emanating from any particular node. The query rate is the rate at which files are requested. 7. The update rates for any particular file emanating from any particular node. The emanating from any particular node. The update rate is the rate at which files are updated (changed). The motivation behind our problem is in the 3 C (Command, Control, and Communications) context. In this context we are considering a Naval Battle Group (BG) composed of aircraft carriers, cruisers, destroyers, airWe desire the craft, etc. Th tr 1. To locate single following: or multiple copies of craft, etc. The BG gathers information *Th' 3research was supported by the Office of Naval Research under contract ONR/N00014-77- C-u6 Z (NR 041-519). characterized by a very great number of variables even for application of limited dimensions; its solution was extremely hard from a conceptual viewpoint. Casey [21,[7] considered the problem of allocating single files separately, but the number of copies of each file was not assuned to be fixed. Communication costs and storage costs of allocations were analyzed in order 1.4 TRADE-OFF OF COSTS to determine the optimal set of nodes on which the file was to be allocated. The difference between retrieval and update transThere is a trade-off in costs if we actions was stressed; while retrieval transconsider the following costs: actions are routed to only one copy of the file, update transactions are routed to all 1. query communication costs the copies, in order to preserve consistency 2. update communication costs Under the assumpof redundant information. 3. storage costs tion of taking equal cost rates for retrieval 4. the cost to the BG if a particular comand updates, theorems were given for limitirng mander does not have access to the parthe number of replicated copies of the file ticular file he desires. on the basis of the update/query ratio; To minimize any one of the four costs we can do minimizethe followingy oneofthefourcostsweobviously the convenience of taking replicated copies decreases while the update/query ratio increases. Although the file alloca1. To reduce the query communication costs, tion problem was analyzed for each file we can store more redundant copies of each file so teat each query can find its separately, Eswaran [3] proved that Casey's each file so that each querycommunication canost. This formulation was NP complete and, therefore, the M files at the N nodes such that the files will be accessible to the commanders who need the files. 2. To locate the files at nodes that will provide the least amount of cost. The cost can be related query communication costs, update communication costs and file storage costs. is true true since since we shall assume assume that that each suggested to investigate heuristic approaches. is we shall each Morgan and Levin [4], examined both the and transactions within allocation query goes to the nearest node containing allocation of files and transactions within the file. a generalized, ARPA-like network. They adop' 2. To reduce the update communication costs, ted the user's viewpoint, assuming to be we can store fewer redundant copies of eachsofile that each update will update under the jurisdiction of a network manageeach file so that each update will update ment providing services at the market price. fewer files and incur less communication Because of this characterization of applicacosts. This is true since we assume each tion environment, storage capacity constraints update goes to all the nodes containing update the file. goes to all were not included; the provision of sufficient the file. storage was considered a task of the network 3. To reduce the storage costs, we can store management. Therefore, by introducing some fewer redundant copies of each file so fewthe cost will decrease. ofeachfilesoother simplifying assumption, the authors 4The ost w d edemonstrated that the multiple file alloca4. To reduce the cost of non-accessibility of files particular to particular com tion problem could be decomposed into independent (single) file allocation problems; manders, we can store more redundant they also developed an heuristic solution 'anders, e we can s copies of that particular file so that it nique. tec has a higher probability of being able to Finally two contributions to the file be accessed by the particular commander. allocation problem have been very recently presented. Ramamoorthy and Wah [51 analyzed The bottom line is: we cannot increase a relational Distributed Data Base; they oband decrease the number of redundant copies served that the general approach of query of a particular file. We would like to find processing optimization consists in the minthe optimal number of redundant copies for all the files and where to store them. imization of communication costs. These canmunication costs are mostly due to data moves .which are necessary for providing the logical 2. LITERATURE SURVEY correlation, expressed by the query, between files stored on different nodes. A logical The file allocation problem was first operation which is particularly critical is investigated by Chu [1]; a global optimizathe join operation between remote files; a tion was considered, consisting in obtaining join between two files can be performed only the minimum overall operating costs subject if the two files are co-located at the same to two kinds of constraints; first, the exnode. Therefore the authors developed a pected time to access each file had to be model in which redundant files are introduced less than a given delay, and secondly the in order to avoid distributed joins, on the amount of storage needed at each computer basis of the frequency of queries. had not to exceed the available storage capacity. The number of copies of each file 3. NOTATION was assumed to be fixed. A generalized model was defined, in which storage and transth mission costs were associated to file allocab. = the available memory size of the i computer tions; channel queues were modeled in order to introduce the constraint on the delay. x.. = the file j is stored at node i The resulting linearized integer program was = Ni node i r(Ni ) = set of nodes that are directly linked to node i x21 + x2 2 +...+ X21 < b2 ; r(Ni r (Ni)= rF... xN1 ) R(N i) = set of nodes that are accessible to node i by a path R(N ) = F(N.)Ui i 2 0 (l-xiM)xkMaiMk<Ti ~ifVk iatk sk royed kj = the cost of locating a copy of a file j at the kth node T.. = the maximum allowable query traffic time of the jt file to the i node tmoftejnodeIfflto = the expected time for the i node to query th th query setofnej the file from the k nodea = the set of node indexes representing a th i we now examine the last term in the minimization, we can simplify the expression. The expected value may be brought inside the summation. Since the importance of the commander i and the importance of file 2. to ccm mander i are not probabilistic, we can simply take the expected value of the accessibility. However, the expected value of the accessibility is simply the probability that commander i can access file £ given the allocation of redundant copies of file Z in the network. We have: FORMULATION Sinceis xi a zero-one variable, the sum over all nodes i must be equal to the number of redundant copies of file L. Therefore we have: rQ (4.1) IZ (R(N i )} = S.A (U) (4.2) E x.I = M at node k which was requested by node j, where each node k is a node that has file Z. The second term denotes the cost of querying file j at node k which was requested by node j, where node k is the closest node containing file Z. The third term represents the cost of storing file Z at node k. The last term denotes the cost associated with the expected accessibility of file £ to the commander at node i weighted by the importance of commander i. mander i. The first set of constraints state that number of files stored at any node must be less than the capacity at each node. The second constraint states that the expected time to retrieve a query is less than a certain threshold quantity. The last constraint states that all the zero-one variables .~ .. = the volume of update traffic emanating 1,ji - the volume of update traffic emanating from node j for file 1 i of from cthe nodet ofor fnil jk the cost of a unit of communication 4. bN x > 0 where the first term in the minimization corresponds to the cost of updating file l the number of redundant copies of file j stored in the system X.. = the volume of query traffic emanating 1 31 from node j for file i < i2)k2ai2k- i2 r. a.i xNM (N i ) y ... if 3k s.t. j at N eR(N ) if 3 s.t. kj at Nk R(N )i if 3k s.t. j at NkeR(Ni) 0 + (1-Xil)xklailkTil; 8. = the value of file j to commander i .. j1 P. .(I) = the probability that file j is ac]3 cessible to the commander at node i given an assignment I A.(i)= xN2 and the importance of commander i (1 + i ( 1 R ))| A (7 ) a E We defineiE Zi = which denotes the accessibility of file L to the commander at node i weighted by the importance of file Z to the commander at node i. The initial formulation is as follows: M m M4 = min m in Xdk+ j=l keI Z L akxk-E I iP'(I ) it l (4.4) P1i,(I) - 1 if Pr(3k s.t. at NkR(Ni)) 0 if Pr(3k x.t. eat NkeR(Ni)) 0 if Pr(Vk s.t. aat Nk,Nk E a.I(R(N k=l k=l destroyed) (4.5) such that X11 + x12 + Li E jl kl N N Ai( A () where N E E i=i x 1...+ bl; Where Pi (It) for one file Z is by definition: N iQ( N j=l N min k=l kfj N N + h h=l M l (l-xk)x.P. 1 jj=l j>k (1-k)x.xh[pj .+(l-p.j)Pih. .=l1 k=l k#j kfh NN I =l min M :I I + a=l . b=l b>a N n " j=l j>h ksb j>a kfa (l-x .)x ] N n I I x P. j=l m=l m mfj + (-1) ifk M C(I) min I I =l (4.11) min C(I) = rI Xm Pim l m 5. is the probability of accessibil- THEOREMS First lets look at allocating one file Z optimally by placing redundant copies at different nodes, so without loss of genrality let I=I . Theorem I: If I=pX for j=l,2,...,n then J an r-node assignment cannot be less costly than the optimal one-node assignment if M min I C(I)= I Q.=l N L j~ 1) N and jE Ik ijk jk k~ 1 k+min ke jd jk k~lk~I~ 2) N (I (4.8) such that M .. . ij b.; j= 1 N (r1) + N min)d j r jk j=l k r k=2 1 1 pr k=2 Theorem I states that if we have allocated r-l copies of a file and the two inequalities hold, then by allocating the rth file, our total cost will not be less than just allocating one file optimally. This will allow us to reduce the possible space in which the integer program < li<N- (l-x. j)x.ak<T..; 13)1. jk))X1i ) 13 +( l it i (l+0) jr + :iaaPi r (Pr-Z-1)(l+p) jpr k L - (5.1) p> 1 - r-l :z LJ N +I o k k k=l l<j<M L N 1] I I £Z~ l<i<N Lets now try to minC(IQ) for a particular Z. The following of theorems will set bounds on how to allocate the files and determine when not to allocate files. ity between nodes i and j. Substituting this back into the initial formulation we now have: M (4.10) If we remove the constraints, we are minimizing over disjoint sets, so we have (4.7) min iiPiiz] <T; L M j=l k=l m=l k>j mfj mfk N N where P.. + jk l ; b a. L] ~N N d j (4.6) j·i I Zp x .<b. i~~j - I which simplifies to the following: N+lN i= kI M + + (-1) L 1 - X kd +min such that k=l k#j j>b N+1 i kei x [Pij (1-P j)P +(l(1-P.)(l1-P )Pi ''] 13 ih ig k ij ijih +() N I ak I kgh = = j=l kII + NN C(I) ifk lij<M -solution must search. Now we can exclude all allocations with more than r-l files from possible file allocations before execution of the integer program. X>O We know that N I k=l (anything)xgk= : keIZ (anything) (4.9) So substituting into the previous formulation we have: Theorem II: If for some integer r<n, r-l and N 2) r p<(pr-X-l)(l+p) d + (l+p) E 6k r j=lj jl r k=2 k k=2 (l+p) N I mn p j=1 k (r-l) a djk 1 + _ r k E k=2 (5.4) then any r-node file assignment is more costly than an optimal one-node assignment. Theorem II states that if we have allocated r-l copies of a file and certain conditions hold, then by allocating the rth file, our total cost will be more than allocating one file optimally. This will allow us to reduce the possible solution space in which the integer program must search. Now we can exclude all allocations with more than r-l files from possible file allocations before execution of the integer program. Define uk as follows: the sequence of costs encountered along any one of these paths decrease monotonically. Thus in order to find the optimal allocaticn policy, it is sufficient to follow every path of the cost graph until the cost increases and no further. This will given a locally optimum allocation of which the global optimum is one of them. This allow us to reduce the solution space of the integer program. Once we find a local optimum then we know that any more file allocation is not required. Hence the integer program will not have to search for solutions in that part of the solution space. Theorem V: include site i where k + ¥k + = _- Y k = k I j=-l -(IU (5.12) N Zi = dj if i min d.. >Z., ji where N Uk = All optimal allocations will d. + Yi + i (5.5) (5.13) j= Theorem VI: No optimal allocation including more than one site will include site i if ()(5.6) ) the following is true: Then the cost function for any given assignment I is given by: N C(I) = uk + I X min dk . ] keI j=l keI j N Zi (5.7) d> X (maxd -d i. j=l k jk ji (5.14) Theorem V states that if the cost of* Theorem III: If having a local file copy is smaller than the smallest possible cost of sending then a local copy should queries elsewhere, (5.8) for k=1,2 C(I)< C(I-[kl) C-I)<for k~l,2 '(I-tk]) (5.8)unquestionably be included in the optimal allocation. This will require certain nodes to have files allocated there. Therefore C(I~Ck])< C(I~[1,2]) for k=1,2 (5.9) the solution space required to be searched by the integer program will be reduced. Theorem III states that our cost graph if The integer program can ignore any possible the given vertex has a cost less than the file allocations that excludes the files cost of two vertices leading to it, then that are unquestionably allocated by the cost of the predecessor of the two Theorem V. vertices is greater than either of the two Theorem VI states the other extreme vertices. of Theorem V. If it costs more to maintain a local copy than we could possible save by Theorem IV: Given an index set XC I, conhaving one, then we do not want one. This taining r elements with the following theorem will allow the integer program to ignore allocating files in locations that property: are definitely too costly and therefore C(I)< C(I[xl]) for each xCX (S.10) reduce the possible solution space for the C-I)< C-Clix]) fr each xX (5.10) integer programming solution. The integer (1 (r) (2) program can ignore file allocations which ' Then for every sequence R (1),R 2) ,..,R include file allocations that are ruled out which are subsets of X, such that R(k) has k elements and R is true: by Theorem VI. Define the following: CR(+l), the following m. = 1 C(l)<C(I-R (1))<C(I-R(2))<C(I-R (3))<.. C(I-R(r)) (5.11) Theorem IV states that if a given vertex has a cost less than the cost of any vertex along the path leading to it, then . min d.., i 1) (5.15) and M. = I X. (max d. -d..) . (5.16) j1 Then for each i the real line is partitioned by mi and Mi into three regions. If Z. falls into region I then it should unquestionably be included. If Z. falls into region III it will be excluded unless all Z. fall in region III, then just include If Z. falls in region II the largest one. then it must be further considered. These theorems are useful because now the region in which the integer program must search for solutions is reduced. We can force files to be allocated in region I and not be allocated in region III. Theorem X: I"' =IU i]. N N [N xk Z [N N Y + i k= i=l k=l M Z=i N L l<k<N ksi (5.17) (5.18) file allocation for this special type of network. Given a node k and a file j, Theorem VIII: then to store a copy of j in k leads to a reduction of the overall costs if the following holds: N +kj ,. , > + a+cj. _Y (5.19) i=l For allocating a new copy later, the theorem states that if allocation of a new copy leads to a cost decrement for the host node which is greater than the overcost due to the necessity of updating the additional solution to the file allocation problem (5.23) I'=IU[k] and I"=I'U[i].If site i satisfies (5.20) for some site k in the network, then C(S") >C(I'). Theorem IX states that if site i is sufficiently costly, then by adding site i to an allocation which already includes k increases the total cost. A site i cannot be included in ]. (5.24) Theorem XI states that instead of determining that no more than one of some group of geographically close sites can be included in an optimal solution, Theorem XI states that certain sites may be excluded from being optimal allocations by the existence of better nearby sites. This theorem is useful because it allows us to reduce the possible solution space of the integer program. The integer program can ignore file allocations that allocate separate copies of the same file at geographically close nodes. Theorem IX, theorem X, and theorem XI are extensions of work done by Grapa and Belford [6]. Theorem XII: P. (I) = The following are equivalent: N N I n j=l k=l k#j (1-x)xjPij (-x )x x + h h h j 1 j k1 j>h k~j kh N Define the following allocations: N > I X max[(djk-dji),0], jk j=l .,O] ji N Zi-Zk > X.max[djk. -d i k j=l 3 jk ji does not require integer programming and It enables therefore is not NP complete. simple calculations to determine file allocation. Zi -d any optimal allocation if there exists another site k in the network such that copy and storage cost, then we should store the file there. This theorem is useful because the Theorem IX: (5.22) N -Z > I X maxtd i k j=-l a 1] kj + kj (5.21) Theorem X states that if: The decision rule for the initial file asif: signment is xij=l Xij +ij ),0], ji C(I''')>C(I') Theorem XI: 1 .iiPi jk is satisfied, then replacing site k by site i in an allocation will increase the cost. X kl + (IZ XAk.(l-XkZ)-i k=l i=l E -d Xmax[(d j=l k By choosing djk=l, kl= and jk kl 1 a completely connected network, then the cost function reduces to the following: N If sites i and k satisfy N Z -Z > Theorem VII: C(IZ)= Define the following allocations N N + .. a=l bl b>a j ij i ij ih N i Xk[ij +(-p ij)p ih j=l k=l j>h k#j .j>b j>a k'b kjb kfa + (1-p j)(l-Pih)pi. ( 1-ij) ih )ig and x N= i Z Xj iJ N+1 .. N N N + -k-l I N+ fmj(-) SJCC (1972). 3. K.P. Eswaran, "Placement of Records in a File p - and ..File Allocation in a Computer Network," Information Processing, North Holland (1974). 4. H.L. Morgan, J.D. Levin, "Optimal Program XmPim m m and Data Locations in Computer Networks," m=kj k3j m N NN ( )NE x: P ++(l,) ( - 1 )N (-1)N1 Z nP X~im jm .m M j=l ml mj#j 5. +I N N xN+lP ~rxpi mm iM m=l (5.26) 6. where p.. is the probability of accessibility 1 between Aodes i and j. This theorem essentially states a generalization of the well known probability law: P(ABC)=P(A)+P(B)+P(C)-P(AB)-P(BC)-P(AC)+P (ABC). (5.27) 6. CONCLUSIONS We have formulated the file allocation 3 problem in a C context where vulnerability is an issue. The formulation considers: 1. The probability of commander accessing files; 2. The importance of commanders; 3. The importance of particular files to particular commanders. The theorems have provided ways to cut down on the possible file allocations (solution space) in which the integer program has to Therefore, we reduce the amount search. of time required to solve for a solution using integer programming. We have extended and proved twelve applicable to the new formulatheorem, all tion. 3 In the C context, we may not need inger programming to solve for a solution if we make the following assumptions: 1. Connected network where all nodes are connected to each other; 2. Cost of communication is same between all nodes; is the same. 3. Cost of storing a file In the area of further research, we plan to explore the effects of where the data sources allocation problem. are located on the file 3 This would be applicable in a C context, where sensor data may come from only a fixed The data must also pass set of nodes. The location of through a processing node. where the processing node should be is also an optimization problem which can be incorporated into our formulation. REFERENCES 1. W.W. Chu, "Optimal File Allocation in a Multiple Computer System." IEEE Transactions on Computer, Vol. C-18, No. 10 (1969). 2. R.G. Casey, "Allocation of Copies of a File in an Information Network," AFIPS, 7. CACM, Vol. 20, No.5 (1977). C.V. Ramamoorthy, B.W. Wah, "The Placement of Relations on a Distributed Relational Data Base," Proc. 1st. Int. Conference on Distributed Computing Systems (1979). E. Grapa and G.G. Belford, "Some Theorems to Aid in solving the File Allocation Problem," CACM, Vol. 20, No. 11, pp. 878-882 (1977). R.G. Casey, "Allocation of Copies of a File in an Information Network," AFIPS, SJCC (1972).