E R S

advertisement

E

NHANCED

R

EPLICA

S

ELECTION

T

ECHNIQUE FOR

B

INDING

R

EPLICA

S

ITES IN

D

ATA

G

RIDS

Mahdi S. Almhanna

#1

, Rafah M. Almuttairi

#2

#1

Institute of Computer & Systems Sciences, CMJ University, Meghalaya, India.

#2

University of Babylon, Babylon, Iraq.

#1 mehdisaleh2004@yahoo.com, #2 rafahmohammed@gmail.com

Abstract

Data Grid technology is developed to share data across many sites in different geographical locations primarily to improve data access and transfer performance. When different sites hold a copy of the files, there are significant benefits realized when selecting the best set of sites that cooperate to speed up the file transfer process. In this paper, we describe a new optimization technique used to reduce total file transferring time by applying a Pincer-Search algorithm of Data mining approach to discover associated distributed replicas sites to share the transferring file process.

Keywords

— Replica Selection Technique, Data Grids,

Association rules. association rules technique of Data Mining approach [8, 16].

We proved that Pincer-Search algorithm is better than Apriori algorithm in our previous work [18]. The reason is that

Pincer-Search consumes less searching time of getting a stable set of replicas. The current work is to extend the selection strategy to enhance the selection performance of our previous work [18]. The improvement comes when additional factors are added along with the latency in the selection criterion. The additional factors are Bandwidth [10], load of CPU and availability of the Memory.

I.

I NTRODUCTION

Grid computing emerges from the need to integrate collection of distributed computing resources to offer performance unattainable by any single machine. Grid technology facilitates data sharing across many organizations in different geographical locations. Data replication is an excellent technique to move and cache data close to users.

Replication is a solution for many grid-based applications such as climatic data analysis and Grid Physics Network [3] which both require responsive navigation and manipulation of large-scale datasets. Moreover, if multiple replicas exist, a replica management service is required to discover the available replicas and select the best replica that matches the user's quality of service requirements.

Our work is about improving the selection mechanism of the management system in Data Grid architecture [2]. Our selection strategy differs from the others [9]. Our proposed strategy selects group of replica sites. The selected sites have similar characteristics in terms of stabilizing the network conditions. These sites concurrently work to send different parts of a large file(s) or different small files. That is done using File Transport Service such as GridFTP or UDT [12,

13].

In our previous works [19, 20, 21, 22, 23, 24, 25], we enhanced the selection strategy using Rough Set theory [21,

22, 23, 25] and association rules technique of Data Mining approach. Apriori algorithm is an association rules technique which is used in our previous work [19, 20, 24] to select set of associated sites [5]. Pincer-Search algorithm is another

II.

E NHANCED R EPLICA S ELECTION T ECHNIQUE

We proposed a replica selection cost model in which we define a formula for calculating the weighting. The score based on the states of various replica server devices has to be collected. The Bandwidth and the latency of links are the most significant factors, affect directly on data transfer time. The other two factors CPU and I/O slightly affect the performance data transfer. In our work and according to the first three factors [1], a score function is defined as the following formula:

Score i

= p i

BW

× w i

BW

+ p i

CPU

× w i

CPU

+ p i

I/O

× w i

I/O

And W i

BW

+ W i

CPU

+ W i

I/O

= 1 (1)

Score

p i

: It represents the replica selection cost model for server i where 1≤ i ≤ n i

BW

: It represents the percentage of Bandwidth available from server i to the computing site; the current Bandwidth divided by the highest theoretical bandwidth w i

BW

: It is the network and Bandwidth ratio of replica server i defined by the administrator of the p organization i

CPU

: It is the percentage CPU idle states of replica server i w i

CPU

: It is the CPU load ratio of server i defined by the p administrator of the Data Grid organization i

I/O

: It is the percentage of memory free space of server i w i

I/O

: It is the memory free space ratio defined by the administrator of the Data Grid organization

Fig. 1. Notifications

The influencing factors which are: w i

BW

, w i

CPU

, and w i

I/O represent weights of; Bandwidth, CPU, and I/O are given in

Equation 1 . These weights can be determined by the administrator of the Data Grid organization. According to different attributes of storage systems in data Grid node. Fig.1 contains the notification of Equation1 .

III.

E NHANCED E FFICIENT S ET ALGORITHM

In our previous work, Efficient Set Technique EST algorithm is used to get set of replica sites working concurrently with minimum cost to transfer the files. In this section, the enhanced algorithm of EST is explained.

Algorithm: Enhanced NHF-Mapping Algorithm

Input:

(1) NHF

(2) l : predefined window size

(3) j : location of the sliding window

Output: LHF

Begin

1: Calculate the Mean of STT

MSTT

i , j

 l

1

 k

 j j

(

STT

i , j

) /

l

i,j

:

2: Calculate the Standard Deviation of STT l

1

 j i,j

:

STDEV

i , j k

 j

STT

i , j

(

STT

i , k

MSTT

i , j

) 2 /

l

3: Qi,j = STDEVi,j / MSTTi,j * 100

4: Find a Coefficient Variation of Replica provider, j using:

M

AV j

 i

1

( Q i , j

) / M /* AV j

is the average of Q i,j

*/

5: Classify links into stable and unstable using the condition:

IF (AV j

< Q i,j

) Then ( LV i,j

= 0 ) Otherwise,

(LV i,j

= 1) /* LV represents a mapped Boolean Value of a

STT */

6: Calculate score of the replicas sites using Equation 1

7: Calculate the summation of all servers score

Sum

 j

M 

1

Score

j

8: Calculate the average of scores, Avg = Sum / M

9: Compare the Score i with the average Avg ,

If Score i ≥ Avg Then , Score i=1,

Otherwise Score i= 0

10: Use logical And Gate to find LV

LV i

= LV i i

^ Score i for i=1 to M (M= number of replicas server sites)

11: LHF := LV;

12: End Enhanced NHF-Mapping

Fig. 2. Pseudo steps of the Enhanced NHF-Mapping Algorithm

IV.

S IMULATION INPUTS

The proposed approach of Enhanced Efficient Set approach is tested using:

1The Network History File NHF Real of real Grid environment called PRAGMA Grid [4, 7]. Uohyd nodes represent a Computing Site where the files should be moved to. The rest of sites represent the replicas where the required files are stored see Fig.3.

Ipref [15] or Network Monitoring Services [11] are used to get the history of Round trip time between

Uohyd and other replica sites [4]. In our simulation

STT= RTT/2

2The Confidence c and Support s values are selected with respect to the nature of Network History File

NHF data sets

Fig. 3. PRAGMA Grid, 28 institutions in 17 regions [4]

To compare between our previous selection technique [18] and the proposed model an OptorSim [6] simulator is used.

OptorSim is a Data Grid simulator designed to allow experiments and evaluations of various selection optimization strategies in Grid environments.

V.

A C OMPARISON BETWEEN EST USING A PRIORI ALGORITHM

AND E NHANCED EST USING P INCER -S EARCH ALGORITHM

In our previous work [19, 20], Apriori algorithm is used as an association rules technique to select set of uncongested replica sites. In our other work [18] and current work we used another association rules algorithm that is called Pincer-

Search algorithm for the following reasons:

1.

Using Apriori algorithm, the number of times of reading the Network History File NHF is equal to the number of replicas in the Maximum Frequent Set

MFS, so the number of reading the NHF is depended on the size of the datasets and the relation between the items

2.

In Pincer-Search algorithm, the searching is Up-

Down. In Up-Down searching method MFS might be given from the first reading to the LHF

3.

The worst case to Pincer-Search algorithm is when the time of getting MFS equals to the time of getting MFS using Apriori algorithm when both read same NHF

For more clarification, assume the number items of

Maximum Frequent Set of NHF using Apriori algorithm is 10 items. And T seconds represents the total time needed to read

NHF once. However, the total time of reading NHF ten times is 10 × T seconds. Using Pincer-Search algorithm there is a chance to get MFS from the first reading to NHF. The best case is the total time to get MFS is equal to T seconds. The worst case is 10 × T . In our simulation we depend on the average time of both cases, so the total time of getting MFS is

(10T +T)/2 = 5.5 T . As shown in Fig. 10, Pincer-Search algorithm reduces the time of searching about 40% as an average, comparing with the priory algorithm. The results show that the performance of using Pincer-Search algorithm as a selection strategy in the optimizer is far better than using

Apriori algorithm as shown in Fig. 7.

VI.

A C OMPARISON OF THE FILE TRANSMISSION TIME OF

REQUESTED FILE USING T RADITIONAL T ECHNIQUES AND

E NHANCED E FFICIENT S ITES T ECHNIQUE EEST

This section is to analyse the results using the traditional technique with all different options and EEST. All configuration parameters are the same for all replica sites. The only difference is the network conditions (Number of Hop counts, BW and RTT).

A) Traditional Method TM (using Hops as criteria to select best replica site) vs. Enhanced Efficient Sites Technique

As shown in Fig. 5-a, the proposed EEST is more efficient than the traditional model. The main reason is, EEST uses set of stable replicas (stable links). So the number of retransmission using EEST is far less than number of retransmission in traditional method. In traditional method the selected replica site is the one which have the least number of

Hop Counts. The selected criterion of the traditional method does not reflect the real picture of the network, especially in the Inter-Grid architecture. For clarification, let us take an example which shown in Fig. 4. In Fig. 4, if the job J

1 is submitted to S

0

. J

1 asks to analysis a file (s). The required file resides in S

1

, S

3

, S

4

, S

14

.Using the Traditional Model with a less number of Hop count, the required file is taken from S

1 or S

14

as the number of Hops (routers) is 1 whereas, the number of Hops between S

0

and S

3

is 2 and between S

0 and S

4 is 3 . File transfer time in case the file is taken from S

1 or S

14 is greater than file transfer time if the file is taken from S

3 as routers ( R

5

, R

7

) are uncongested whereas ( R

4

, R

8

) are congested. Same thing will happen when TM chooses the best site depending on the highest bandwidth between two sites.

B) Traditional Model (using BW as criteria to select best replica site) vs. Enhanced Efficient Sites Model

As shown in Fig.5-b, EEST is better than TM. In TM the selected is the one which have the maximum Bandwidth between the computing site and replica’s site. The Bandwidth of the link does not reflect the real picture of the network link separating two sites. It means that the link with less value of

Bandwidth might be faster than others.

Replica Site RS, Not Replica Site, Computing Site (Uohyd node)

Congested Router, Uncongested Router

Fig.4. PRAGMA Grid and their associated network geometry, S

0

represent UoHyd [7]

File Transfer Time Difference

(MSec)

80000000

TM

(

Hop count ) EEST

70000000

60000000

50000000

40000000

0

1 2

File Transfer Time Difference

(MSec)

3

File Index

4

(a) Traditional Model (TM) using HopCount

100000000

80000000

60000000

40000000

20000000

0

1 2

TM(Bandwidth)

3

File Index

4

EEST

5

5 include packet-transmission delays (the transmission rate out of each router and out of the source host), packet-propagation delays (the propagation on each link), packet-queuing delays in intermediate routers and switches, and packet-processing delays (the processing delay at each router and at the source host) for two directions (from computing site to replica site and vice versa). Therefore, the total delays from one end to another will be the summation of all these delays [14]. Even

Time (MSec)

100000000

80000000

60000000

40000000

20000000

0

TM (SD(RTT) TM (Mean(RTT) TM (RTT) EEST

1 file transfer time.

2 3

4

File Index

5

Fig. 6. Transferring time of requested files by Job1 using TM using

(RTT) and Enhanced Efficient Set Technique EEST

35.00

30.00

25.00

20.00

15.00

10.00

5.00

0.00

40%

Apriori Pincer

50% 60% 70% 80%

(b) Traditional Model (TM) using the Bandwidth

Fig.5. Transferring time of requested files by Job1 using traditional model and Enhanced Efficient Set Technique (EEST)

C) Traditional Model (using RTT as criteria to select best replica site) vs. Enhanced Efficient Sites Technique

In this section, the traditional model selects the best replica site depending on the Round Trip Time value. RTT (Round

Trip Time) is the time taken by the small packet to travel from a client to server and back to the client. The RTT delays

Support s

Fig.7. Comparison between Total Searching Time using Apriori and

Pincer-Search algorithms

VII.

A C OMPARISON OF THE SEARCHING TIME OF SELECTION

ALGORITHMS USING A PRIORI AND PINCER SEARCH

ALGORITHMS

This section is to compare the total searching time used to get set of associated sites using both association rules techniques which are Apriori [19] and Pincer-Search

[18]. As shown in Fig. 7. To get similar sets of associated sites Apriori algorithms consumes more time than Pincer-Search algorithm the reasons are explained in Section V above. Different support values are used to accurate the results. The extra time needed by Apriori to read the database and get set of stable sites increase total

VIII.

D ISCUSSION AND CONCLUSION though RTT will reflect the real picture of network link, but still EESM is better because it probes one direction route. It looks to a single trip starting from replica site to computing site because this direction will affect on transferring the files but not the other once. We test EESM with the least RTT, the least Means of RTT and the least Standard Deviation of RTT values of RTT as shown in Fig.6. As we see the EEST is better than all RTT values criterion.

In this paper, we proposed a new replica selection model in data grid to optimize total time of executing the job by reducing total file transfer time. Our proposed model utilizes

Pincer-Search algorithm. The difference between our model and others such as the traditional model is: Our technique gets a set of sites with stable links work concurrently to transfer requested files. The traditional model selects one site as a best

replica’s site and getting a set of sites would not reflect the real network condition. i.e., most probably this model will not pay any attention whether these sites uncongested links or not at the transferring moment because the traditional model depends upon the Bandwidth alone or Hop counts alone which do not describe the real network condition, whereas we depend on the STT which reflects the real network conditions.

The difference between the current model and the most recent model that we proposed in [18] is, we enhanced the mapping function. In the previous model [18] latency of the link was the only factor used to separate the replicas having uncongested links. In the current work more affected factors such as Bandwidth, CPU load and availability of the Memory are used. These dynamic factors improve the quality of the selection replica sites. The set of replicas have, high

Bandwidth, low load of CPU, good I/O speed and stable links between the computing site and replica servers.

F UTURE WORK

Being a node of PRAGMA we are looking forward to deploy our technique as an independent service in PRAGMA

Data Grid infrastructure to speed up the execution of data grid job and minimize total cost of requested files.

R EFERENCES

[1] Chao-Tung Yang, Shih-Yu Wang, William Cheng-Chung Chu:

“Implementation of a dynamic adjustment strategy for parallel file transfer in co-allocation data grids”. The Journal of Supercomputing

54(2): 180-205 (2010).

[2] Chao-Tung Yang, Chun-Hsiang Chen, Kuan-Ching Li, Ching-Hsien

Hsu: Performance Analysis of Applying Replica Selection Technology for Data Grid Environments. PaCT 2005: 278-287.

[3]

R. Kavitha, I. Foster, “Design and evaluation of replication strategies for a high performance data grid”, in, Proceedings of Computing and

High Energy and, Nuclear Physics, 2001.

[4] http://goc.pragma-grid.net/cgibin/scmsweb/probe.cgi?source=GOC&grid=PRAGMA

[5] A. K. Pujari, “Data Mining Techniques”. Niversities Press, India, 2001.

[6]

W. Bell, et al., “OptorSim _ A grid simulator for studying dynamic data replication strategies”, Journal of High Performance Computing

Applications 17 (4) (2003).

[7] http://goc.pragma-grid.net/wiki/index.php/UoHyd

[8]

DI Lin, ZM Kedem, “Pincer-Search: A New Algorithm for

Discovering the Maximum Frequent Set”, - IEEE Transactions on

Knowledge and Data, 2002.

[9] S. Vazhkudai, S. Tuecke, I. Foster, “Replica Selection in the Globus

Data Grid”, Cluster Computing and the Grid, IEEE International

Symposium on, p. 106, 1st IEEE International Symposium on Cluster

Computing and the Grid (CCGrid'01), 2001.

[10]

A. Tirumala, J. Ferguson, “Iperf 1.2 - The TCP/UDP Bandwidth

Measurement Tool”, 2002.

[11] R. Wolski, “Dynamically forecasting network performance using the

Network Weather Service”, Cluster Computing (1998).

[12]

Yunhong Gu, Robert L. Grossman, “UDT: UDP-based data transfer for high-speed wide area networks, Computer Networks”, Volume 51,

Issue 7, 16 May 2007, Pages 1777 1799. Elsevier.

[13] I. Foster, S. Vazhkudai, and J. Schopf. “ Predicting the performance of wide-area data transfers”. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing (SC 2008), pages 286-297, Nov. 2008.

[14]

J. F. Kurose and K. W. Ross. “Computer networking - A top-down approach featuring the internet”. Addison-Wesley-Longman, 2001.

[15] Iperf, http://sourceforge.net/projects/iperf.

[16] R. Agrawal, I. Tomasz, and A. Swami. “Mining association rules between sets of items in large databases”. In Proceedings of the 1993

ACM SIGMOD international conference on Management of data,

SIGMOD '93, pages 207-216, New York, NY, USA, 1993. ACM.

[17] R. M. Rahman, R. Alhajj, and K. Barker. “Replica selection strategies in data grid”. In Journal of Parallel Distributing and Computing, volume 68, pages 1561-1574, 2008.

[18]

R. M. Almuttairi, “Replica Selection Technique for Binding Cheapest

Replica Sites in Data Grids”, In Proceedings of International

Conference of Advanced Computer Science & Information Technology

(ACSIT-2012), Chennai, pages 295-312 India 15July 2012.

[19] R. M. Almuttairi, R. Wankar, A. Negi, C. R. Rao, and M. S. Almahna.,

“New replica selection technique for binding replica sites in data grids”.

In Proceedings of 1st International Conference on Energy, Power and

Control (EPC-IQ), pages 187-194, Dec. 2010.

[20] R. M. Almuttairi, R. Wankar, A. Negi, and C. R. Rao., “Intelligent replica selection strategy for data grid”. In Proceedings of the 2010

International Conference on Grid Computing and Applications

(GCA'10), pages 95-101, 2010.

[21] R. M. Almuttairi, R. Wankar, A. Negi, and C. R. Rao., “Replica selection in data grids using preconditioning of decision attributes by k-means clustering”. In Proceedings of the Vaagdevi International

Conference on Information Technology for Real World Problems, pages 18-23, Los Alamitos, CA, USA, 2010. IEEE Computer Society.

[22] R. M. Almuttairi, R. Wankar, A. Negi, and C. R. Rao., “Rough set clustering approach to replica selection in data grids”. In Proceedings of the 10th International Conference on Intelligent Systems Design and

Applications (ISDA), pages 1195-1200, Egypt, 2010.

[23]

R. M. Almuttairi, R. Wankar, A. Negi, and C. R. Rao., “Smart replica selection for data grids using rough set approximations”. In

Proceedings of the International Conference on Computational

Intelligence and Communication Networks (CICN2010), pages 466-

471, Los Alamitos, CA, USA, 2010. IEEE Computer Society.

[24] R. M. Almuttairi, R. Wankar, A. Negi, and C. R. Rao., “Enhanced data replication broker”. In Proceedings of the 5th Multi-disciplinary

International Workshop on Artificial Intelligence (MIWAI2011), pages

286-297, 2011.

[25] Praveen Ganghishetti, Rajeev Wankar, Rafah M. Almuttairi, C.

Raghavendra Rao., “Rough Set Based Quality of Service Design for

Service Provisioning in Clouds”, 6th International Conference on

Rough Sets and Knowledge Technology (RSKT 2011) pp. 268-273,

Lecture Notes in Computer Science 6954, Springer, 2011, ISBN 978-3-

642-24424-7, October 2011, Canada.

Download