Advanced Operating Systems Lecture 9: Distributed Systems Architecture University of Tehran Dept. of EE and Computer Engineering By: Dr. Nasser Yazdani Univ. of Tehran Distributed Operating Systems 1 Covered topic Distributed Systems Architectures References Chapter 2 of the text book Chord: Anatomy of Grid Univ. of Tehran Distributed Operating Systems 2 Outline Distributed Systems Architecture Client-server Peer to peer Computing Cloud Computing Grid computing Univ. of Tehran Distributed Operating Systems 3 Architectural Models Concerned with The placement of the components across a network of computers The interrelationships between the components Common Architectures Client – server, Web Peer to peer Cloud Grid Univ. of Tehran Distributed Operating Systems 4 Clients and Servers General interaction between a client and a server. 1.25 Univ. of Tehran Distributed Operating Systems 5 Processing Level The general organization of an Internet search engine into three different layers 1-28 Univ. of Tehran Distributed Operating Systems 6 Multitiered Architectures (1) Alternative client-server organizations (a) – (e). 1-29 Univ. of Tehran Distributed Operating Systems 7 Multitiered Architectures (2) An example of a server acting as a client. 1-30 Univ. of Tehran Distributed Operating Systems 8 Client-Server •Creating for example a hotmail? What are the options? •One server? •Several servers? Client invocation res ult invocation Server Server res ult Client Key: Process : Univ. of Tehran Distributed Operating Systems Computer: 9 Multiple Servers Service Server Client Server Client Server Univ. of Tehran Distributed Operating Systems 10 HTTP Basics (Review) HTTP layered over bidirectional byte stream Almost always TCP Interaction Client sends request to server, followed by response from server to client Requests/responses are encoded in text Stateless Server maintains no information about past client requests How to Mark End of Message? (Review) Size of message Content-Length Delimiter MIME-style Content-Type Must know size of transfer in advance Server must “escape” delimiter in content Close connection Only server can do this HTTP Request (review) Request line Method GET – return URI HEAD – return headers only of GET response POST – send data to the server (forms, etc.) URL (relative) E.g., /index.html HTTP version HTTP Request (cont.) (review) Request headers Authorization – authentication info Acceptable document types/encodings From – user email If-Modified-Since Referrer – what caused this page to be requested User-Agent – client software Blank-line Body HTTP Request (review) HTTP Request Example (review) GET / HTTP/1.1 Accept: */* Accept-Language: en-us Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0) Host: www.intel-iris.net Connection: Keep-Alive HTTP Response (review) Status-line HTTP version 3 digit response code 1XX – informational 2XX – success 3XX – redirection 404 Not Found 5XX – server error 301 Moved Permanently 303 Moved Temporarily 304 Not Modified 4XX – client error 200 OK 505 HTTP Version Not Supported Reason phrase HTTP Response (cont.) (review) Headers Location – for redirection Server – server software WWW-Authenticate – request for authentication Allow – list of methods supported (get, head, etc) Content-Encoding – E.g x-gzip Content-Length Content-Type Expires Last-Modified Blank-line Body HTTP Response Example (review) HTTP/1.1 200 OK Date: Tue, 27 Mar 2001 03:49:38 GMT Server: Apache/1.3.14 (Unix) (Red-Hat/Linux) mod_ssl/2.7.1 OpenSSL/0.9.5a DAV/1.0.2 PHP/4.0.1pl2 mod_perl/1.24 Last-Modified: Mon, 29 Jan 2001 17:54:18 GMT ETag: "7a11f-10ed-3a75ae4a" Accept-Ranges: bytes Content-Length: 4333 Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Content-Type: text/html ….. Typical Workload (Web Pages) Multiple (typically small) objects per page File sizes Heavy-tailed Pareto distribution for tail Lognormal for body of distribution -- For reference/interest only -- Embedded references Number of embedded objects = pareto – p(x) = akax-(a+1) HTTP 0.9/1.0 (mostly review) One request/response per TCP connection Simple to implement Disadvantages Multiple connection setups three-way handshake each time Several extra round trips added to transfer Multiple slow starts Single Transfer Example Client SYN 0 RTT Client opens TCP connection 1 RTT Client sends HTTP request for HTML SYN DAT ACK 2 RTT ACK Server reads from DAT disk FIN ACK Client parses HTML Client opens TCP connection FIN ACK 3 RTT Client sends HTTP request for image 4 RTT SYN SYN ACK DAT ACK Image begins to arrive Server DAT Server reads from disk More Problems Short transfers are hard on TCP Lots of extra connections Stuck in slow start Loss recovery is poor when windows are small Increases server state/processing Server also forced to keep TIME_WAIT connection state -- Things to think about - Why must server keep these? Tends to be an order of magnitude greater than # of active connections, why? Persistent Connection Solution (review) Multiplex multiple transfers onto one TCP connection How to identify requests/responses Delimiter Server must examine response for delimiter string Content-length and delimiter Must know size of transfer in advance Block-based transmission send in multiple length delimited blocks Store-and-forward wait for entire response and then use content-length Solution use existing methods and close connection otherwise Persistent Connection Example (review) Client 0 RTT Client sends HTTP request for HTML DAT ACK DAT 1 RTT Client parses HTML Client sends HTTP request for image 2 RTT Image begins to arrive Server Server reads from disk ACK DAT ACK DAT Server reads from disk Persistent HTTP (review) Nonpersistent HTTP issues: Persistent without pipelining: Requires 2 RTTs per object Client issues new request OS must work and allocate only when previous host resources for each TCP response has been received connection One RTT for each But browsers often open referenced object parallel TCP connections to Persistent with pipelining: fetch referenced objects Default in HTTP/1.1 Persistent HTTP Client sends requests as Server leaves connection soon as it encounters a open after sending response referenced object Subsequent HTTP messages As little as one RTT for all between same client/server the referenced objects HTTP Caching Clients often cache documents Challenge: update of documents If-Modified-Since requests to check HTTP 0.9/1.0 used just date HTTP 1.1 has an opaque “entity tag” (could be a file signature, etc.) as well When/how often should the original be checked for changes? Check every time? Check each session? Day? Etc? Use Expires header If no Expires, often use Last-Modified as estimate Ways to cache Client-directed caching Web Proxies Server-directed caching Content Delivery Networks (CDNs) Caching Example (1) Assumptions Average object size = 100,000 bits Avg. request rate from institution’s browser to origin servers = 15/sec Delay from institutional router to any origin server and back to router = 2 sec Consequences Utilization on LAN = 15% Utilization on access link = 100% Total delay = Internet delay + access delay + LAN delay origin servers public Internet 1.5 Mbps access link institutional network 10 Mbps LAN Caching Example (2) Possible solution Increase bandwidth of access link to, say, 10 Mbps origin servers public Internet Often a costly upgrade Consequences Utilization on LAN = 15% Utilization on access link = 15% Total delay = Internet delay + access delay + LAN delay = 2 sec + msecs + msecs 10 Mbps access link institutional network 10 Mbps LAN Caching Example (3) Install cache origin servers Suppose hit rate is .4 Consequence 40% requests will be satisfied almost immediately (say 10 msec) 60% requests satisfied by origin server Utilization of access link reduced to 60%, resulting in negligible delays Weighted average of delays = .6*2 sec + .4*10msecs < 1.3 secs public Internet 1.5 Mbps access link institutional network 10 Mbps LAN institutional cache Problems First: Over 50% of all HTTP objects are uncacheable – why? Dynamic data stock prices, scores, web cams CGI scripts results based on passed parameters Obvious fixes SSL encrypted data is not cacheable Most web clients don’t handle mixed pages well many generic objects transferred with SSL Cookies results may be based on passed data Hit metering owner wants to measure # of hits for revenue, etc. Second: How about other clients using the same data. Web Proxy Caches User configures browser: Web accesses via cache Browser sends all HTTP requests to cache Object in cache: cache returns object Else cache requests object from origin server, then returns object to client origin server Proxy server client client origin server Content Distribution Networks (CDNs) Problem: so many cleints? Content replication CDN company installs hundreds of CDN servers throughout Internet Close to users CDN replicates its customers’ content in CDN servers. When provider updates content, CDN updates servers origin server in North America CDN distribution node CDN server in S. America CDN server in Europe CDN server in Asia Networks & Server Selection Replicate content on many servers Challenges How to replicate content Where to replicate content How to find replicated content How to choose among known replicas How to direct clients towards replica Consistency of data? Server Selection Which server? Lowest load to balance load on servers Best performance to improve client performance Based on Geography? RTT? Throughput? Load? Any alive node to provide fault tolerance How to direct clients to a particular server? As part of routing anycast, cluster load balancing As part of application HTTP redirect As part of naming DNS Application Based HTTP supports simple way to indicate that Web page has moved (30X responses) Server receives Get request from client Decides which server is best suited for particular client and object Returns HTTP redirect to that server Can make informed application specific decision May introduce additional overhead multiple connection setup, name lookups, etc. OK solution in general, but… HTTP Redirect has some flaws – especially with current browsers Incurs many delays, which operators may really care Naming Based Client does DNS name lookup for service Name server chooses appropriate server address A-record returned is “best” one for the client What information can name server base decision on? Server load/location must be collected Information in the name lookup request Name service client typically the local name server for client How Akamai Works Clients fetch html document from primary server E.g. fetch index.html from cnn.com URLs for replicated content are replaced in html E.g. <img src=“http://cnn.com/af/x.gif”> replaced with <img src=“http://a73.g.akamaitech.net/7/23/cnn.com/af/x.gif”> Client is forced to resolve aXYZ.g.akamaitech.net hostname How Akamai Works How is content replicated? Akamai only replicates static content (*) Modified name contains original file name Akamai server is asked for content First checks local cache If not in cache, requests file from primary server and caches file How Akamai Works Root server gives DNS record for akamai.net Akamai.net name server returns DNS record for g.akamaitech.net Name server chosen to be in region of client’s name server TTL is large G.akamaitech.net name server chooses server in region Should try to chose server that has file in cache How to choose? Uses aXYZ name and hash TTL is small why? Simple Hashing Given document XYZ, we need to choose a server to use Suppose we use modulo n Number servers from 1…n Place document XYZ on server (XYZ mod n) What happens when a servers fails? n n-1 Same if different people have different measures of n Why might this be bad? Consistent Hash “view” = subset of all hash buckets that are visible Desired features Balanced – in any one view, load is equal across buckets Smoothness – little impact on hash bucket contents when buckets are added/removed Spread – small set of hash buckets that may hold an object regardless of views Load – across all views # of objects assigned to hash bucket is small Consistent Hash – Example • Construction • Assign each of C hash buckets to random points on mod 2n circle, where, hash key size = n. • Map object to random position on circle • Hash of object = closest clockwise bucket 0 14 12 Bucket 4 8 Smoothness addition of bucket does not cause movement between existing buckets Spread & Load small set of buckets that lie near object Balance no bucket is responsible for large number of objects How Akamai Works cnn.com (content provider) DNS root server Akamai server Get foo.jpg 12 Get index. html 1 11 2 5 3 6 7 4 8 9 End-user 10 Get /cnn.com/foo.jpg Akamai high-level DNS server Akamai low-level DNS server Nearby matching Akamai server Akamai – Subsequent Requests cnn.com (content provider) Get index. html 1 DNS root server Akamai high-level DNS server 2 7 8 9 End-user Akamai server 10 Get /cnn.com/foo.jpg Akamai low-level DNS server Nearby matching Akamai server Impact on DNS Usage DNS is used for server selection more and more What are reasonable DNS TTLs for this type of use Typically want to adapt to load changes Low TTL for A-records what about NS records? How does this affect caching? What do the first and subsequent lookup do? HTTP (Summary) Simple text-based file exchange protocol Workloads Typical documents structure, popularity Server workload Interactions with TCP Support for status/error responses, authentication, client-side state maintenance, cache maintenance Connection setup, reliability, state maintenance Persistent connections How to improve performance Persistent connections Caching Replication Why Study Peer to peer? To understand how they work To build your own peer to peer system To understand the techniques and principles within them To modify, adapt, reuse these techniques and principles in other related areas Cloud computing Sensor networks Why Peer to Peer? to share and exchange resources they have books, class notes, experiences, videos, music cd’s Why not Web: is heavy weight for specific resources: General framework Somebody want to share his/her movie, book, music. Others: want to watch that great Movie, etc. -They can download and watch, but needs 1.Search: “better off dead” -> better_off_dead.mov or -> 0x539fba83ajdeadbeef 2.Locate sources of better_off_dead.mov 3.Download the file from them 50 Searching Need search. N1 Key=“title” Value=MP3 data… Publisher N2 Internet N4 N5 N3 ? Client Lookup(“title”) N6 51 Search Approaches Centralized Flooding A hybrid: Flooding between “Supernodes” Structured 52 Primitives & Structure Common Primitives: Join: how to I begin participating? Publish: how do I advertise my file? Search: how to I find a file? Fetch: how to I retrieve a file? Centralized Database: Join: on startup, client contacts central server Publish: reports list of files to central server Search: query the server => return node(s) that store the requested file 53 Napster Example: Publish insert(X, 123.2.21.23) ... Publish I have X, Y, and Z! 123.2.21.23 54 Napster: Search 123.2.0.18 Fetch Query search(A) --> 123.2.0.18 Reply Where is file A? 55 Napster: Discussion Pros: Simple Search scope is O(1) for even complex searches (one index, etc.) Controllable (pro or con?) Cons: Server maintains O(N) State Server does all processing Single point of failure Technical failures + legal (napster shut down 2001) 56 Query Flooding Join: Must join a flooding network Usually, establish peering with a few existing nodes Publish: no need, just reply Search: ask neighbors, who ask their neighbors, and so on... when/if found, reply to sender. TTL limits propagation 57 Example: Gnutella I have file A. I have file A. Reply Query Where is file A? 58 Flooding: Discussion Pros: Cons: Fully de-centralized Search cost distributed Processing @ each node permits powerful search semantics Search scope is O(N) Search time is O(???) Nodes leave often, network unstable TTL-limited search works well for haystacks. For scalability, does NOT search every node. May have to re-issue query later 59 Supernode Flooding Why everybody should participate in search? A subset of nodes for search, supernode, like multicast. Kazal Technology Mechanism: Join: on startup, client contacts a “supernode” ... may at some point become one itself Publish: send list of files to supernode Search: send query to supernode, supernodes flood query amongst themselves. Supernode network just like prior flooding net 60 Supernode Network Design “Super Nodes” 61 Supernode: File Insert insert(X, 123.2.21.23) ... Publish I have X! 123.2.21.23 62 Supernode: File Search search(A) --> 123.2.22.50 123.2.22.50 Query Replies search(A) --> 123.2.0.18 Where is file A? 123.2.0.18 63 Supernode: Which nodes? Often, bias towards nodes with good: Bandwidth Computational Resources Availability! 64 Stability and Superpeers Why superpeers? Query consolidation Caching effect Many connected nodes may have only a few files Propagating a query to a sub-node would take more b/w than answering it yourself Requires network stability Superpeer selection is time-based How long you’ve been on is a good predictor of how long you’ll be around. 65 Superpeer results Basically, “just better” than flood to all Gets an order of magnitude or two better scaling But still fundamentally: o(search) * o(per-node storage) = O(N) central: O(1) search, O(N) storage flood: O(N) search, O(1) storage Superpeer: can trade between 66 Structured Search: Distributed Hash Tables Academic answer to p2p Goals Makes some things harder Guatanteed lookup success Provable bounds on search time Provable scalability Fuzzy queries / full-text search / etc. Read-write, not read-only Hot Topic in networking since introduction in ~2000/2001 67 Searching Wrap-Up Type O(search) storage Fuzzy? Central O(1) O(N) Yes Flood ~O(N) O(1) Yes Super < O(N) > O(1) Yes O(log N) not really Structured O(log N) 68 DHT: Overview Abstraction: a distributed “hash-table” (DHT) data structure: put(id, item); item = get(id); Implementation: nodes in system form a distributed data structure Can be Ring, Tree, Hypercube, Skip List, Butterfly Network, ... 69 DHT: Overview (2) Structured Overlay Routing: Join: On startup, contact a “bootstrap” node and integrate yourself into the distributed data structure; get a node id Publish: Route publication for file id toward a close node id along the data structure Search: Route a query for file id toward a close node id. Data structure guarantees that query will meet the publication. Important difference: get(key) is for an exact match on key! search(“spars”) will not find file(“briney spars”) We can exploit this to be more efficient 70 DHT: Example - Chord Associate to each node and file a unique id in an uni-dimensional space (a Ring) E.g., pick from the range [0...2m] Usually the hash of the file or IP address Properties: Routing table size is O(log N) , where N is the total number of nodes Guarantees that a file is found in O(log N) hops from MIT in 2001 71 DHT: Consistent Hashing Key 5 Node 105 K5 N105 K20 Circular ID space N32 N90 K80 A key is stored at its successor: node with next higher ID 72 DHT: Chord Basic Lookup N120 N10 N105 “N90 has K80” “Where is key 80?” N32 K80 N90 N60 73 DHT: Chord “Finger Table” 1/4 1/2 1/8 1/16 1/32 1/64 1/128 N80 • Entry i in the finger table of node n is the first node that succeeds or equals n + 2i • In other words, the ith finger points 1/2n-i way around the ring 74 Node Join Compute ID Use an existing node to route to that ID in the ring. Finds s = successor(id) ask s for its predecessor, p Splice self into ring just like a linked list p->successor = me me->successor = s me->predecessor = p s->predecessor = me 75 DHT: Chord Join Assume an identifier space [0..8] Node n1 joins Succ. Table i id+2i succ 0 2 1 1 3 1 2 5 1 0 1 7 6 2 5 3 4 76 DHT: Chord Join Node n2 joins Succ. Table i id+2i succ 0 2 2 1 3 1 2 5 1 0 1 7 6 2 Succ. Table 5 3 4 i id+2i succ 0 3 1 1 4 1 2 6 1 77 DHT: Chord Join Succ. Table i id+2i succ 0 1 1 1 2 2 2 4 0 Nodes n0, n6 join Succ. Table i id+2i succ 0 2 2 1 3 6 2 5 6 0 1 7 Succ. Table i id+2i succ 0 7 0 1 0 0 2 2 2 6 2 Succ. Table 5 3 4 i id+2i succ 0 3 6 1 4 6 2 6 6 78 DHT: Chord Join Succ. Table Nodes: n1, n2, n0, n6 Items: f7, f2 i i id+2 0 1 1 2 2 4 Items 7 succ 1 2 0 0 1 7 Succ. Table i id+2i succ 0 7 0 1 0 0 2 2 2 Succ. Table 6 i i id+2 0 2 1 3 2 5 Items succ 1 2 6 6 2 Succ. Table 5 3 4 i id+2i succ 0 3 6 1 4 6 2 6 6 79 DHT: Chord Routing Succ. Table i Upon receiving a query for item id, a node: Checks whether stores the item locally If not, forwards the query to the largest node in its successor table that does not exceed id Succ. Table i id+2i succ 0 7 0 1 0 0 2 2 2 i id+2 0 1 1 2 2 4 Items 7 succ 1 2 0 0 Succ. Table 1 7 i i id+2 0 2 1 3 2 5 query(7) 6 Items succ 1 2 6 6 2 Succ. Table 5 3 4 i id+2i succ 0 3 6 1 4 6 2 6 6 80 DHT: Chord Summary Routing table size? Log N fingers Routing time? Each hop expects to 1/2 the distance to the desired id => expect O(log N) hops. 81 DHT: Discussion Pros: Guaranteed Lookup O(log N) per node state and search scope Cons: This line used to say “not used.” But: Now being used in a few apps, including BitTorrent. Supporting non-exact match search is (quite!) hard 82 The limits of search: A Peer-to-peer Google? Complex intersection queries (“the” + “who”) Billions of hits for each term alone Sophisticated ranking Must compare many results before returning a subset to user Very, very hard for a DHT / p2p system Need high inter-node bandwidth (This is exactly what Google does - massive clusters) But maybe many file sharing queries are okay... 83 Fetching Data Once we know which node(s) have the data we want... Option 1: Fetch from a single peer Problem: Have to fetch from peer who has whole file. Peers not useful sources until d/l whole file At which point they probably log off. :) How can we fix this? 84 Chunk Fetching More than one node may have the file. How to tell? Must be able to distinguish identical files Not necessarily same filename Same filename not necessarily same file... Use hash of file Common: MD5, SHA-1, etc. How to fetch? Get bytes [0..8000] from A, [8001...16000] from B Alternative: Erasure Codes 85 BitTorrent: Overview Swarming: Join: contact centralized “tracker” server, get a list of peers. Publish: Run a tracker server. Search: Out-of-band. E.g., use Google to find a tracker for the file you want. Fetch: Download chunks of the file from your peers. Upload chunks you have to them. Big differences from Napster: Chunk based downloading (sound familiar? :) “few large files” focus Anti-freeloading mechanisms 86 BitTorrent Periodically get list of peers from tracker More often: Ask each peer for what chunks it has (Or have them update you) Request chunks from several peers at a time Peers will start downloading from you BT has some machinery to try to bias towards helping those who help you 87 BitTorrent: Publish/Join Tracker 88 BitTorrent: Fetch 89 BitTorrent: Summary Pros: Works reasonably well in practice Gives peers incentive to share resources; avoids freeloaders Cons: Central tracker server needed to bootstrap swarm (Tracker is a design choice, not a requirement, as you know from your projects. Modern BitTorrent can also use a DHT to locate peers. But approach still needs a “search” mechanism) 90 Writable, persistent p2p Do you trust your data to 100,000 monkeys? Node availability hurts Ex: Store 5 copies of data on different nodes When someone goes away, you must replicate the data they held Hard drives are *huge*, but cable modem upload bandwidth is tiny - perhaps 10 Gbytes/day Takes many days to upload contents of 200GB hard drive. Very expensive leave/replication situation! 91 What’s out there? Central Flood Whole File Napster Gnutella Chunk Based BitTorrent Supernode flood Route Freenet KaZaA (bytes, not chunks) DHTs eDonkey2 000 92 P2P: Summary Many different styles; remember pros and cons of each centralized, flooding, swarming, unstructured and structured routing Lessons learned: Single points of failure are bad Flooding messages to everyone is bad Underlying network topology is important Not all nodes are equal Need incentives to discourage freeloading Privacy and security are important Structure can provide theoretical bounds and guarantees 93 Cloud Computing Infrastructure Take a seat & prepare to fly Anh M. Nguyen CS525, UIUC, Spring 2009 94 What is cloud computing? I don’t understand what we would do differently in the light of Cloud Computing other than change the wordings of some of our ads Larry Ellision, Oracle’s CEO I have not heard two people say the same thing about it [cloud]. There are multiple definitions out there of “the cloud” Andy Isherwood, HP’s Vice President of European Software Sales It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign. 95 Richard Stallman, Free Software Foundation founder What is a Cloud? It’s a cluster! It’s a supercomputer! It’s a datastore! It’s superman! None of the above All of the above Cloud = Lots of storage + compute cycles nearby 96 What is a Cloud? A single-site cloud (aka “Datacenter”) consists of Compute nodes (split into racks) Switches, connecting the racks A network topology, e.g., hierarchical Storage (backend) nodes connected to the network Front-end for submitting jobs Services: physical resource set, software services A geographically distributed cloud consists of Multiple such sites Each site perhaps with a different structure and services 97 A Sample Cloud Topology Core Switch op of the Rack Switch Rack Servers 98 Scale of Industry Datacenters Microsoft [NYTimes, 2008] Yahoo! [Hadoop Summit, 2009] 25,000 machines Split into clusters of 4000 AWS EC2 (Oct 2009) 150,000 machines Growth rate of 10,000 per month Largest datacenter: 48,000 machines 80,000 total running Bing 40,000 machines 8 cores/machine Google (Rumored) several hundreds of thousands of machines 99 The first datacenters! “A Cloudy History of Time” © 1940 1950 Timesharing Companies & Data Processing Indu 1960 Clusters 1970 Grids 1980 PCs (not distributed!) 1990 2000 Peer to peer systems 2010 Clouds and datacenters 100 “A Cloudy History of Time” First large datacenters: ENIAC, ORDVAC, ILLIAC Many used vacuum tubes and mechanical relays Berkeley NOW Project Supercomputers Server Farms (e.g., Ocean © P2P Systems (90s-00 •Many Millions of use •Many GB per day Data Processing Industry 968: $70 M. 1978: $3.15 Billion. Timesharing Industry (1975): •Market Share: Honeywell 34%, IBM 15%, •Xerox 10%, CDC 10%, DEC 10%, UNIVAC 10% •Honeywell 6000 & 635, IBM 370/168, Xerox 940 & Sigma 9, DEC PDP-10, UNIVAC 1108 Grids (1980s-2000s): Clouds •GriPhyN (1970s-80s) •Open Science Grid and Lambda Rail (2000s) 101 Trends: Technology Doubling Periods – storage: 12 mos, bandwidth: 9 mos, and (what law is this?) cpu speed: 18 mos Then and Now Bandwidth 1985: mostly 56Kbps links nationwide 2004: 155 Mbps links widespread Disk capacity Today’s PCs have 100GBs, same as a 1990 supercomputer 102 Trends: Users Then and Now Biologists: 1990: were running small single-molecule simulations 2004: want to calculate structures of complex macromolecules, want to screen thousands of drug candidates, sequence very complex genomes Physicists 2008 onwards: CERN’s Large Hadron Collider will produce 700 MB/s or 15 PB/year Trends in Technology and User Requirements: Independent or Symbiotic? 103 Prophecies In 1965, MIT's Fernando Corbató and the other designers of the Multics operating system envisioned a computer facility operating “like a power company or water company”. Plug your thin client into the computing Utility and Play your favorite Intensive Compute & Communicate Application [Have today’s clouds brought us closer to this reality?] 104 So, clouds have been around for decades! But aside from massive scale what’s new about today’s cloud computing?! 105 What(’s new) in Today’s Clouds? Three major features: On-demand access: Pay-as-you-go, no upfront commitment. I. Anyone can access it (e.g., Washington Post – Hillary Clinton example) Data-intensive Nature: What was MBs has now become TBs. II. Daily logs, forensics, Web data, etc. Do you know the size of Wikipedia dump? New Cloud Programming Paradigms: MapReduce/Hadoop, Pig Latin, DryadLinq, Swift, and many others. III. High in accessibility and ease of programmability Combination of one or more of these gives rise to novel and unsolved distributed computing problems in cloud computing. 106 I. On-demand access: *aaS Classification On-demand: renting a cab vs (previously) renting a car, or buying one. E.g.: HaaS: Hardware as a Service You get access to flexible computing and storage infrastructure. Virtualization is one way of achieving this. Often said to subsume HaaS. Ex: Amazon Web Services (AWS: EC2 and S3), Eucalyptus, Rightscale. PaaS: Platform as a Service You get access to barebones hardware machines, do whatever you want with them Ex: Your own cluster, Emulab IaaS: Infrastructure as a Service AWS Elastic Compute Cloud (EC2): $0.086-$1.16 per CPU hour AWS Simple Storage Service (S3): $0.055-$0.15 per GB-month You get access to flexible computing and storage infrastructure, coupled with a software platform (often tightly) Ex: Google’s AppEngine SaaS: Software as a Service You get access to software services, when you need them. Often said to subsume SOA (Service Oriented Architectures). Ex: Microsoft’s LiveMesh, MS Office on demand 107 II. Data-intensive Computing Computation-Intensive Computing Data-Intensive Example areas: MPI-based, High-performance computing, Grids Typically run on supercomputers (e.g., NCSA Blue Waters) Typically store data at datacenters Use compute nodes nearby Compute nodes run computation services In data-intensive computing, the focus shifts from computation to the data: CPU utilization no longer the most important resource metric Problem areas include Distributed systems Middleware OS Storage Networking Security Others 108 III. New Cloud Programming Paradigms Dataflow programming frameworks Google: MapReduce and Sawzall Yahoo: Hadoop and Pig Latin Microsoft: DryadLINQ Facebook: Hive Amazon: Elastic MapReduce service (pay-as-you-go) Google (MapReduce) Indexing: a chain of 24 MapReduce jobs ~200K jobs processing 50PB/month (in 2006) Yahoo! (Hadoop + Pig) WebMap: a chain of 100 MapReduce jobs 280 TB of data, 2500 nodes, 73 hours Facebook (Hadoop + Hive) ~300TB total, adding 2TB/day (in 2008) 3K jobs processing 55TB/day Similar numbers from other companies, e.g., Yieldex, 109 Two Categories of Clouds Industrial Clouds Can be either a (i) public cloud, or (ii) private cloud Private clouds are accessible only to company employees Public clouds provide service to any paying customer: Amazon S3 (Simple Storage Service): store arbitrary datasets ,pay per GBonth stored Amazon EC2 (Elastic Compute Cloud): upload and run arbitrary images, pay per CPU hour used Google AppEngine: develop applications within their appengine framework, upload data that will be imported into their format, and run Academic Clouds Allow researchers to innovate, deploy, and experiment Google-IBM Cloud (U. Washington): run apps programmed atop Hadoop Cloud Computing Testbed (CCT @ UIUC): first cloud testbed to support systems research. Runs: (i) apps programmed atop Hadoop and Pig, (ii) systems-level research on this first generation of cloud computing models (~HaaS), and (iii) Eucalyptus services (~AWS EC2). http://cloud.cs.illinois.edu OpenCirrus: first federated cloud testbed. http://opencirrus.org 110 Academic Clouds CCT = Cloud Computing Testbed NSF infrastructure Used by 10+ NSF projects, including several nonUIUC projects Housed within Siebel Center (4th floor!) Accessible to students of CS525! Almost half of SP09 course used CCT for their projects OpenCirrus = Federated Cloud Testbed Contains CCT and other sites If you need a CCT account for your CS525 experiment, let me know asap! There are a limited number of these available for CS525 111 Cloud Computing Testbed (CCT) 112 CCT Hardware in more Detail •128 compute nodes = 64+ •500 TB & 1000+ shared co 113 Goal of CCT Support both Systems Research and Applications Research in Data-intensive Distributed Computing 114 Open Cirrus Federation First open federated cloud testbed Shared: research, applications, infrastructure (9*1,000 cores), data sets Global services: sign on, monitoring, store, etc., Federated clouds, meaning each is different RAS Intel HP KIT (de) ETRI Yahoo UIUC CMU IDA (sg) MIMOS 115 21 March 2016 Grown to 9 sites, with more to come 115 10 Challenges [Above the Clouds] (Index: Performance Data-related Scalability Logisitical) Availability of Service: Use Multiple Cloud Providers; Use Elasticity; Prevent DDOS Data Lock-In: Enable Surge Computing; Standardize APIs Data Confidentiality and Auditability: Deploy Encryption, VLANs, Firewalls: Geographical Data Storage Data Transfer Bottlenecks: Data Backup/Archival; Higher BW Switches; New Cloud Topologies; FedExing Disks Performance Unpredictability: QoS; Improved VM Support; Flash Memory; Schedule VMs Scalable Storage: Invent Scalable Store Bugs in Large Distributed Systems: Invent Debuggers; Real-time debugging; predictable pre-run-time debugging Scaling Quickly: Invent Good Auto-Scalers; Snapshots for Conservation Reputation Fate Sharing Software Licensing: Pay-for-use licenses; Bulk use sales 116 A more Bottom-Up View of Open Research Directions Myriad interesting problems that acknowledge the characteristics that make today’s cloud computing unique: massive scale + on-demand + dataintensive + new programmability + and infrastructure- and applicationspecific details. Monitoring: of systems&applications; single site and multi-site Storage: massive scale; global storage; for specific apps or classes Failures: what is their effect, what is their frequency, how do we achieve fault-tolerance? Scheduling: Moving tasks to data, dealing with federation Communication bottleneck: within applications, within a site Locality: within clouds, or across them Cloud Topologies: non-hierarchical, other hierarchical Security: of data, of users, of applications, confidentiality, integrity Availability of Data Seamless Scalability: of applications, of clouds, of data, of everything Inter-cloud/multi-cloud computations Second Generation of Other Programming Models? Beyond MapReduce! Pricing Models 117 Explore the limits today’s of cloud computing New Parallel Programming Paradigms: MapReduce Highly-Parallel Data-Processing Originally designed by Google (OSDI 2004 paper) Open-source version called Hadoop, by Yahoo! Hadoop written in Java. Your implementation could be in Java, or any executable Google (MapReduce) Indexing: a chain of 24 MapReduce jobs ~200K jobs processing 50PB/month (in 2006) Yahoo! (Hadoop + Pig) WebMap: a chain of 100 MapReduce jobs 280 TB of data, 2500 nodes, 73 hours Annual Hadoop Summit: 2008 had 300 attendees, 2009 had 700 attendees 118 What is MapReduce? Terms are borrowed from Functional Language (e.g., Lisp) Sum of squares: (map square ‘(1 2 3 4)) Output: (1 4 9 16) [processes each record sequentially and independently] (reduce + ‘(1 4 9 16)) (+ 16 (+ 9 (+ 4 1) ) ) Output: 30 [processes set of all records in a batch] 119 Map Process individual key/value pair to generate intermediate key/value pairs. Welcome Everyone Hello Everyone Input <filename, file text> Welcome Everyone Hello Everyone 1 1 1 1 120 Reduce Processes and merges all intermediate values associated with each given key assigned to it Welcome Everyone Hello Everyone 1 1 1 1 Everyone 2 Hello 1 Welcome 1 121 Some Applications Distributed Grep: Map - Emits a line if it matches the supplied pattern Reduce - Copies the intermediate data to output Count of URL access frequency Map – Process web log and outputs <URL, 1> Reduce - Emits <URL, total count> Reverse Web-Link Graph Map – process web log and outputs <target, source> Reduce - emits <target, list(source)> 122 Programming MapReduce Externally: For user 1. 2. 3. Write a Map program (short), write a Reduce program (short) Submit job; wait for result Need to know nothing about parallel/distributed programming! Internally: For the cloud (and for us distributed systems researchers) 1. 2. 3. 4. Parallelize Map Transfer data from Map to Reduce Parallelize Reduce Implement Storage for Map input, Map output, Reduce input, and Reduce output 123 Inside MapReduce For the cloud (and for us distributed systems researchers) Parallelize Map: easy! each map job is independent of the other! 1. All Map output records with same key assigned to same Reduce 2. Transfer data from Map to Reduce: 3. 4. All Map output records with same key assigned to same Reduce task use partitioning function (more soon) Parallelize Reduce: easy! each reduce job is independent of the other! Implement Storage for Map input, Map output, Reduce input, and Reduce output Map input: from distributed file system Map output: to local disk (at Map node); uses local file system Reduce input: from (multiple) remote disks; uses local file systems Reduce output: to distributed file system local file system = Linux FS, etc. distributed file system = GFS (Google File System), HDFS (Hadoop Distributed File System) 124 Internal Workings of MapReduce 125 Fault Tolerance Worker Failure Master keeps 3 states for each worker task Master sends periodic pings to each worker to keep track of it (central failure detector) (idle, in-progress, completed) If fail while in-progress, mark the task as idle If map workers fail after completed, mark worker as idle Reduce task does not start until all Map tasks done, and all its (Reduce’s) data has been fetched Master Failure Checkpoint 126 Locality and Backup tasks Locality Since cloud has hierarchical topology GFS stores 3 replicas of each of 64MB chunks Maybe on different racks Attempt to schedule a map task on a machine that contains a replica of corresponding input data: why? Stragglers (slow nodes) Due to Bad Disk, Network Bandwidth, CPU, or Memory. Perform backup (replicated) execution of straggler task: task done when first replica complete 127 Testbed: 1800 servers each with 4GB RAM, dual 2GHz Xeon, dual 169 GB IDE disk, 100 Gbps, Gigabit ethernet per machine Grep Locality optimization helps: 1800 machines read 1 TB at peak ~31 GB/s W/out this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs Workload: 1010 100-byte records to extract records matching a rare pattern (92K matching records) 128 Discussion Points Hadoop always either outputs complete results, or none Storage: Is the local write-remote read model good for Map output/Reduce input? Partial results? Can you characterize partial results of a partial MapReduce run? What happens on node failure? Can you treat intermediate data separately, as a firstclass citizen? Entire Reduce phase needs to wait for all Map tasks to finish: in other words, a barrier Why? What is the advantage? What is the disadvantage? Can you get around this? 129 Grid 1. 2. 3. 4. 5. What is Grid? Grid Projects & Applications Grid Technologies Globus CompGrid Definition A type of parallel and distributed system that enables the sharing, selection, & aggregation of geographically distributed resources: Computers – PCs, workstations, clusters, supercomputers, laptops, notebooks, mobile devices, PDA, etc; Software – e.g., ASPs renting expensive special purpose applications on demand; Catalogued data and databases – e.g. transparent access to human genome database; Special devices/instruments – e.g., radio telescope – SETI@Home searching for life in galaxy. People/collaborators. depending on their availability, capability, cost, and user QoS requirements for solving large-scale problems/applications. thus enabling the creation of “virtual organization” (VOs) Resources = assets, capabilities, and knowledge Capabilities (e.g. application codes, analysis tools) Compute Grids (PC cycles, commodity clusters, HPC) Data Grids Experimental Instruments Knowledge Services Virtual Organisations Utility Services Why go Grid? Hot subject Try it, experience it to learn the potential Will enable true ubiquitous computing in future Today, proven in some areas: intraGrids, But still long way to World Wide Grid State of art techniques, tools are difficult Short term goals? Use another technology Does your system have Grid characteristics? Distributed users, large scale and heterogeneous resources, across domains Grid‘s main idea To treat CPU cycles and software like commodities. Enable the coordinated use of geographically distributed resources – in the absence of central control and existing trust relationships. Computing power is produced much like utilities such as power and water are produced for consumers. Users will have access to “power” on demand “When the Network is as fast as the computer’s internal links, the machine disintegrates across the Net into a set of special purpose appliances” – Gilder Technology Report June 2000 Computational Grids and Electric Power Grids What do users want ? Grid Consumers Execute jobs for solving varying problem size and complexity Benefit by selecting and aggregating resources wisely Tradeoff timeframe and cost Grid Providers Contribute (“idle”) resource for executing consumer jobs Benefit by maximizing resource utilisation Tradeoff local requirements & market opportunity Grid Applications Distributed HPC (Supercomputing): Computational science. High-Capacity/Throughput Computing: Large scale simulation/chip design & parameter studies. Content Sharing (free or paid) Sharing digital contents among peers (e.g., Napster) Remote software access/renting services: Application service provides (ASPs) & Web services. Data-intensive computing: Drug Design, Particle Physics, Stock Prediction... On-demand, real-time computing: Medical instrumentation & Mission Critical. Collaborative Computing: Collaborative design, Data exploration, education. Service Oriented Computing (SOC): Towards economic-based Utility Computing: New paradigm, new applications, new industries, and new business. Grid Projects Australia Nimrod-G Gridbus GridSim Virtual Lab DISCWorld GrangeNet ..new coming up Europe UNICORE Cactus UK eScience EU Data Grid EuroGrid MetaMPI XtremeWeb and many more. India I-Grid Japan Ninf DataFarm Korea... N*Grid USA Globus Legion OGSA Sun Grid Engine AppLeS NASA IPG Condor-G Jxta NetSolve AccessGrid and many more... Cycle Stealing & .com Initiatives Distributed.net SETI@Home, …. Entropia, UD, Parabon,…. Public Forums Global Grid Forum Australian Grid Forum IEEE TFCC CCGrid conference P2P conference Grid Requirements Identity & authentication Authorization & policy Resource discovery Resource characterization Resource allocation (Co-)reservation, workflow Distributed algorithms Remote data access High-speed data transfer Performance guarantees Monitoring Adaptation Intrusion detection Resource management Accounting & payment Fault management System evolution Etc. Problem Enabling secure, controlled remote access to computational resources and management of remote computation – Authentication and authorization – Resource discovery & characterization – Reservation and allocation – Computation monitoring and control Challenges Locate “suitable” computers Authenticate with appropriate sites Allocate resources on those computers Initiate computation on those computers Configure those computations Select “appropriate” communication methods Compute with “suitable” algorithms Access data files, return output Respond “appropriately” to resource changes Leading Grid Middleware Developments Globus Toolkit (mainly developed at ANL and USC) Service-oriented toolkit from the Globus project,to be used in Grid applications, not targeted at end-user Services for resource selection and allocation, authentication, file system access and file transfer, … Largest user-base in projects worldwide Open-source software, commercial support by IBM and Platform Computing The Globus Alliance Globus Project ™, since 1996 Ian Foster (Argonne National Lab), Carl Kesselman (University of Southern California’s Information Science Institute) Develop protocols, middleware and tools for Grid computing Globus Alliance, since Sept 2003 International scope University of Edinburgh’s EPCC Swedish Center for Parallel Computers (PDC) Advisory council of Academic Affiliates from AsiaPacific, Europe, US Globus Toolkit GT2 (2.4 released in 2002): reference implementation of Grid fabric protocols GT3 (3.0 released July 2003): redesign GRAM for job submissions MDS for resource discovery GridFTP for data transfer GSI security OGSI based Grid services, built on SOAP and XML GT3.2 released March 31, 2004 Globus Toolkit Services Job submission and management (GRAM) Security (GSI) LDAP-based Information Service Remote file management (GASS) and transfer (GridFTP) PKI-based Security (Authentication) Service Information services (MDS) Uniform Job Submission Remote Storage Access Service Remote Data Catalogue and Management Tools Support by Globus 2.0 released in 2002 Resource selection and allocation (GIIS, GRIS) Resource Specification Language Common notation for exchange of information between components RSL provides two types of information: Syntax similar to MDS/LDAP filters Resource requirements: Machine type, number of nodes, memory, etc. Job configuration: Directory, executable, args, environment API provided for manipulating RSL Protocols Make the Grid Protocols and APIs Protocols enable interoperability APIs enable portability Sharing is about interoperability, so … Grid architecture should be about protocols Grid Services Architecture: Previous Perspective … a rich variety of applications ... Applns Appln Toolkits Remote data toolkit Remote comp. toolkit Remote viz toolkit Async. collab. toolkit ... Remote sensors toolkit Grid Services Protocols, authentication, policy, resource management, instrumentation, discovery, etc., etc. Grid Fabric Grid-enabled archives, networks, computers, display devices, etc.; associated local services Characteristics of Grid Services Architecture Identifies separation of concerns Isolates Grids from languages and specific programming environments Makes provisions for generic and application specific functionality Protocols not explicit in architecture fails to make clear distinction between language, service and networking issues Layered Grid Protocol Architecture Application User Grid Resource Connectivity Fabric Important Points Being Grid-enabled requires speaking appropriate protocols Protocol only requirement, not reachability Protocols can be used to bridge local resources or “local Grids” Built on Internet protocols Independent of language and implementation Intergrid as analog to Internet Focus on interaction over network Services exist at each level Protocols, services and interfaces Applications Languages/Frameworks User Service APIs and SDKs User Services Grid Service APIs and SDKs Grid Services Resource APIs and SDKs Resource Services User Service Protocols Grid Service Protocols Resource Service Protocols Connectivity APIs Connectivity Protocols Local Access APIs and protocols Fabric Layer How does Globus fit in? Defines connectivity and resource protocols Enables definition of grid and user protocols Globus provides some of these, others defined by other groups Defines range of APIs and SDKs that leverage Resource, Grid and User protocols Fabric Local access to logical resource May be real component, e.g. CPU, software module, filesystem May be logical component, e.g. Condor pool Protocol or API mediated Fabric elements include: SSP, ASP, peer-to-peer, Entropia-like, and enterprise level solutions Connectivity Protocols Two classes of connectivity protocols underlie all other components Internet communication Application, transport and internet layer protocols I.e., transport, routing, DNS, etc. Security Authentication and delegation Discussed below Security Protocols Services K5ssl, Globus Authorization Service APIs TLS with delegation GSS-API, GAA, SASL, gss_assist SDKs GlobusIO Resource Protocols Resource management, Storage system access Network quality of service Data movement Resource information Resource Management Protocols Resource services GRAM+GARA (on HTTP) Gatekeeper, JobManager, SlotManager APIs and SDKs GRAM API, JavaCog Client, DUROC Data Transport Protocols Services Grid FTP, LDAP for replica catalog FTP, LDAP replica catalog APIs and SDKs GridFTP client library, copy URL API, replica catalog access, replica selection Resource Information Protocol Service LDAP V3, Registration/Discovery protocol GRIS APIs & SDKs C API; JNDI, PerlLDAP, …. Grid Protocols Grid Information Index Services LDAP and Service registration protocol, … GIIS service LDAP APIs and specialized information API Co-allocation and brokering GRAM (HTTP+RSL) DUROC service DUROC client API, end-to-end reservation API Grid Protocols (cont) Online authentication, authorization services HTTP MyProxy, Group policy servers Myproxy API, GAA API, Many others (e.g.): Resource discovery (Matchmaker) Fault recovery User Protocols In general, there are many of these, they tend to be on off, and not well defined Examples: Portal toolkits (e.g. Hotpage) Netsolve Cactus framework Next Lecture Communication among distributed systems. Remote Procedure Call (RPC) References Chapter 4 of the book Univ. of Tehran Distributed Operating Systems 173