Google File System Simulator Pratima Kolan Vinod Ramachandran Google File System • • • • Master Manages Metadata Data Transfer Happens directly between client and chunk server Files broken into 64 MB chunks Chunks replicated across three machines for safety Event Based Simulation Component 1 Get Next High Priority Event from Queue Place Event in Priority Queue Simulator Priority Queue Component 2 Event 1 Event 2 Event 3 Output of simulated event Component 3 Simplified GFS Architecture Client Switch: Infinite Bandwidth Master Server Switch Represent Network Queues Network Disk 1 Network Disk 2 Network Disk 3 Network Disk 4 Network Disk 5 Data Flow The client queries the master server for a Chunk ID it wants to read. The master server returns a set of disks ids that contain the Chunk. The client requests a disk for the Chunk The disk transfers the data to the client Experiment Setup • We have a client whose bandwidth can be varied from 0…..1000 Mbps • We have 5 disks each a having a per disk bandwidth of 40 Mbps • We have 3 chunk replicas per chunk of data as a baseline • Each client request is for 1 Chunk of data from a disk Simplified GFS Architecture Client Bandwidth varied from 0…..1000 Mbps Client Switch: Infinite Bandwidth Master Server Switch Represent Network Queues Network Disk 1 Network Disk 2 Network Disk 3 Network Disk 4 Network Disk 5 Chunk ID: 0-1000 0-1000 0-2000 Per Disk Bandwidth : 40 Mbps 1001-2000 1001-2000 Experiment 1 • Disk Requests Served With out Load Balancing – In this case we pick the first chunk server from the list of available chunk servers that contains the disk block. • Disk Requests Served With Load Balancing – In this case we apply a greedy algorithm and balance the load of incoming requests across the 5 disks Expectation • In the Non load balancing case we expect the effective request/data rate to reach a peak value of 2 disks(80 Mbps) • In the load balancing case we expect the effective request/data rate to reach a peak value of 5 disks(200 Mbps) Load Balancing Graph This graph plots the data rate at client vs. client bandwidth Experiment 2 • Disk Requests Served With No Dynamic Replication – In this case we have a fixed number of replicas(3 in our case) and the server does not create more replication based on statistics for read requests. • Disk Requests Served With Dynamic Replication – In this case the server replicates certain chunks based on the frequency of the chunk requests. – We define a replication factor , which is fraction < 1 – No of Replicas For Chunk = (replication factor) * No of requests For The Chunk – We Cap the Max No of Replicas by the Number of disks Expectation • Our Requests are all aimed on the chunks placed in disk 0,disk 1 , disk2. • In the non replication case we expect the effective data rate at the client to me limited by the bandwidth provided by 3 disks(120 Mbps) • In the replication case we expect the effective data rate at the client to me limited by the bandwidth provided by 5 disks(200 Mbps) Replication Graph This graph plots the data rate at client vs. client bandwidth Experiment 3 • Disk Requests Served with no Rebalancing – In this case we do not implement any rebalancing of read requests based on frequency of chunk requests • Disk Requests Served with Rebalancing – In this case we perform rebalancing of read requests by picking a request with highest frequency and transferring it to a disk with a lesser load Graph 3 Request Distribution Graph 5000 4500 4000 No of Requests 3500 In each disk 3000 No Re-Balancing,No Replication 2500 No Re-Balancing,Replication 2000 Re-Balancing,No Replication 1500 1000 500 0 Disk 0 Disk 1 Disk 2 Disk 3 Disk 4 Conclusion and Future Work • GFS is a simple file system for large-data intensive applications • We studied the behavior of certain read workloads on this file system • In the future we would like to come up with optimizations that could fine tune GFS