8. Distributed Web-based Systems Sistemes Distribuïts en Xarxa (SDX) Facultat d’Informàtica de Barcelona (FIB) Universitat Politècnica de Catalunya (UPC) 2013/2014 Q2 Contents • Introduction • Architecture • Communication • Naming • Synchronization • Consistency & replication • Fault tolerance SDX (2) Distributed Web-based systems Introduction • The World Wide Web is a huge distributed system with millions of clients and servers for accessing hyperlinked documents • Key issues – – – – – – – Architecture Communication Naming Synchronization Consistency & replication Fault tolerance Security (not covered in this course) SDX (3) Distributed Web-based systems Contents • Introduction • Architecture • Communication • Naming • Synchronization • Consistency & replication • Fault tolerance SDX (4) Distributed Web-based systems Traditional Web-based systems • Client-server architecture – Server: store documents, run remote commands – Browser: retrieve and display documents SDX (5) Distributed Web-based systems Traditional Web-based systems • Documents can be: – Plain text – HTML: HyperText Markup Language • Most common • Allows embedding of links to other documents – XML: Extensible Markup Language • More flexible (i.e. it contains the names, types and structure of the data elements within it) – Images, Audio, Video – Application (e.g. PDF, PS) • Can include scripts that execute in the client SDX (6) Distributed Web-based systems Multitiered architectures • Common Gateway Interface (CGI) – Server runs a program taking user data as input – This has now evolved to servlets and JSPs • This has lead to three-tiered architectures SDX (7) Distributed Web-based systems Three-tiered architecture • Distribute logical layers across the tiers – User-interface, processing, and data layer SDX (8) Distributed Web-based systems Web server clusters • Web servers can be clustered transparently to clients to improve performance & availability SDX (9) Distributed Web-based systems Web server clusters • Problem: The front end may easily get overloaded, thus must be carefully designed 1. Transport-layer switch • Front end simply passes the data sent along the TCP connection to one of the servers, taking some measurement on servers load into account 2. Content-aware request distribution – Front end inspects the content of the HTTP request and then selects the best server ↑ Flexible, better profit of caching/replication ↓ High demand on front end SDX (10) Distributed Web-based systems READING REPORT • Barroso09. The Datacenter as a Computer: Workloads and Software Infrastructure SDX (11) Distributed Web-based systems Web Services • Go beyond simple user-site interaction and offer services to remote applications – Communication using Internet standards • Web Services components – A standard way for communication (SOAP) – A uniform data representation and exchange mechanism (XML) – A standard meta language to describe the services offered (WSDL) – A mechanism to register and locate WS-based applications (UDDI) SDX (12) Distributed Web-based systems Web Services SDX (13) Distributed Web-based systems Contents • Introduction • Architecture • Communication • Naming • Synchronization • Consistency & replication • Fault tolerance SDX (14) Distributed Web-based systems HyperText Transfer Protocol (HTTP) • Simple client-server transfer protocol for communication in traditional web systems • One resource per HTTP request • Based on TCP • HTTP v1.0 uses non-persistent connections – Each request to a server requires setting up a separate TCP connection SDX (15) Distributed Web-based systems HyperText Transfer Protocol (HTTP) • HTTP v1.1 uses persistent connections – Same TCP connection can be used to issue several requests • Pipelining: Client can also issue several requests in a row without waiting the response for the first SDX (16) Distributed Web-based systems HyperText Transfer Protocol (HTTP) • Operations supported by HTTP Operation Description Head Request to return the header of a resource Get Request to return a resource to the client. Can be a document or the output of a program execution Put Request to store data as a resource Post Provide data that are to be handled by a resource Delete Request to delete a resource SDX (17) Distributed Web-based systems HyperText Transfer Protocol (HTTP) • HTTP request message – Client message headers: e.g. acceptable content types e.g. acceptable document encoding e.g. list of client’s credentials e.g. conditions on the latest date of modification of the resource SDX (18) Distributed Web-based systems HyperText Transfer Protocol (HTTP) • HTTP response message – Server message header: e.g. time of last modification e.g. response’s expiration time e.g. URL to redirect request SDX (19) Distributed Web-based systems Simple Object Access Protocol • SOAP is the standard protocol for communication between Web services • Based on XML (extends XML-RPC) ↑ XML allows self-describing data (portable) ↓ Introduces performance problems ↓ Not meant to be read by human beings • SOAP is platform independent, but bound to an underlying protocol (a.k.a. carrier) – Specifies bindings to underlying transfer protocols • Currently HTTP, SMTP • Recipient’s address is specified by the transfer protocol SDX (20) Distributed Web-based systems Simple Object Access Protocol • SOAP message consists of two elements which are jointly put inside a Envelope 1. Header (optional) – Contains application specific information relevant for nodes along the path from sender to receiver • e.g. routing, security, transactions 2. Body (mandatory) – Contains the actual message SDX (21) Distributed Web-based systems SOAP example: Request <soap:Envelope URI of XML schema for SOAP envelopes xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Header> <m:Trans xmlns:m="http://www.w3schools.com/transaction/" soap:mustUnderstand="1">234</m:Trans> </soap:Header> <soap:Body> URI of XML schema for the service description <m:GetPrice xmlns:m="http://www.w3schools.com/prices"> <m:Item>Apples</m:Item> </m:GetPrice> </soap:Body> </soap:Envelope> SDX (22) Distributed Web-based systems SOAP example: Response <soap:Envelope URI of XML schema for SOAP envelopes xmlns:soap="http://www.w3.org/2001/12/soap-envelope" soap:encodingStyle="http://www.w3.org/2001/12/soap-encoding"> <soap:Header> <m:Trans xmlns:m="http://www.w3schools.com/transaction/" soap:mustUnderstand="1">234</m:Trans> </soap:Header> URI of XML schema for the service description <soap:Body> <m:GetPriceResponse xmlns:m="http://www.w3schools.com/prices"> <m:Price>1.90</m:Price> </m:GetPriceResponse> </soap:Body> </soap:Envelope> SDX (23) Distributed Web-based systems Web Services Description Language • WSDL is a formal language for describing precisely the interfaces provided by a WS – Procedure specification, data types, logical location of services • Based on XML • Can be automatically translated to client and server stubs • Analogous to IDL in RPCs SDX (24) Distributed Web-based systems Contents • Introduction • Architecture • Communication • Naming • Synchronization • Consistency & replication • Fault tolerance SDX (25) Distributed Web-based systems Naming • Documents referred by URIs • Two forms of Uniform Resource Identifiers – URN: Uniform Resource Name • Globally unique, location independent and persistent reference to a document – A resolution operation is needed to find it • e.g. urn:ietf:rfc:3187 – URL: Uniform Resource Locator • Includes information on how and where to access the document • e.g. http://tools.ietf.org/html/rfc3187.html SDX (26) Distributed Web-based systems Naming • URL contents – Scheme: Application-level protocol for transferring the document (e.g. http, ftp) – DNS name/IP address of server – Pathname of the resource in server’s file system – Input query to the resource – Fragment to identify a component of the resource http:// servername [:port] [/pathname] [?query] [#fragment] SDX (27) Distributed Web-based systems Naming http://www.cs.vu.nl/home/steen/mbox http://www.cs.vu.nl:80/home/steen/mbox http://130.37.24.11:80/home/steen/mbox http://www.cdk5.net http://www.w3.org/standards/faq.html#intro http://www.google.com/search?q=obama scheme server name port pathname http www.cs.vu.nl (default) /home/steen/mbox (none) (none) http www.cs.vu.nl 80 /home/steen/mbox (none) (none) http 130.37.24.11 80 /home/steen/mbox (none) (none) http www.cdk5.net (default) (default) (none) (none) http www.w3.org (default) standards/faq.html (none) intro http www.google.com (default) search q=obama (none) SDX (28) Distributed Web-based systems query fragment Naming • Examples of URIs Name Used for Example http HTTP http://www.cs.vu.nl:80/globe mailto E-mail mailto:steen@cs.vu.nl ftp FTP ftp://ftp.cs.vu.nl/pub/minix/README file Local file file:/edu/book/work/chp/11/11 data Inline data data:text/plain;charset=iso-88597,%e1%e2%e3 telnet Remote login telnet://flits.cs.vu.nl tel Telephone tel:+31201234567 Modem Modem modem:+31201234567;type=v32 SDX (29) Distributed Web-based systems Web Services naming • UDDI directory service – Universal Description Discovery Interface • Contains service descriptions for allowing users browsing for relevant services • Provides pointers to WSDL documents • WSDL service descriptions may be looked up by name or by attribute • UDDI is also based on XML SDX (30) Distributed Web-based systems Contents • Introduction • Architecture • Communication • Naming • Synchronization • Consistency & replication • Fault tolerance SDX (31) Distributed Web-based systems Synchronization • Not an issue in traditional web-based systems – Nothing to synchronize since servers do not exchange information with other servers – WWW is a read-mostly system. Updates are done by a single entity • But today, distributed authoring of documents is emerging – Synchronization is needed – Let’s see how this is accomplished in collaborative editing of documents: Google Docs • Based on Operational Transformation (OT) SDX (32) Distributed Web-based systems Google Docs collaboration • A document is stored in the server as a list of chronological changes: revision log – 3 basic types of changes: inserting text, deleting text, and applying styles to a range of text • e.g. {InsertText ‘SDX' @10}, {DeleteText @9-11}, {ApplyStyle bold @10-20} – Append each change to the end of the revision log A. Collaborative protocol to sync changes 1. Each editor sends changes to the server and waits for acknowledgement • Changes during this period are keep in a pending list (never send more than one change at a time) SDX (33) Distributed Web-based systems Google Docs collaboration 2. For each change, server updates revision log, acknowledges a new revision to the sender and sends change to the other editors 3. Each editor transforms incoming changes against its pending changes so that they make sense relative to the local version of the document • Operational Transformation to define the different ways that InsertText, DeleteText, and ApplyStyle changes can be paired and transformed against each other – http://googledocs.blogspot.com.es/2010/09/whatsdifferent-about-new-google-docs_22.html – http://googledocs.blogspot.com.es/2010/09/whats-differentabout-new-google-docs_23.html SDX (34) Distributed Web-based systems Contents • Introduction • Architecture • Communication • Naming • Synchronization • Consistency & replication • Fault tolerance SDX (35) Distributed Web-based systems Client-side caching 1. Browser’s cache 2. Web proxy caching – Site installs a separate proxy server that passes all requests from local clients to the Web servers – Proxy subsequently caches incoming documents a) Hierarchical caches • • Place caches covering a region (even a country) On cache miss, the request is moved upward b) Cooperative caching • On cache miss, the web proxy checks the neighboring proxies SDX (36) Distributed Web-based systems Cooperative caching SDX (37) Distributed Web-based systems Cache-consistency protocols 1. Verify validity by contacting web server ↑ Strong consistency ↓ Server must be contacted for each request 2. Age-based consistency – Used by Squid Web proxy – Assign an expiration time to each document • Depends on how long ago the document was last modified when it is cached – Document is considered valid until this time ↑ Better performance ↓ Weaker consistency SDX (38) Distributed Web-based systems Contents • Introduction • Architecture • Communication • Naming • Synchronization • Consistency & replication – Content Distribution Networks (CDN) • Fault tolerance SDX (39) Distributed Web-based systems Content Distribution Networks • Goal: Improve content delivery performance & reliability without tremendous infrastructure investments by the content providers • Idea: distributed web server – CDN replicates contents across the Internet and delivers them to users on behalf of origin sites • Typically high demand content e.g. graphics, media, etc. Reduces the load on origin servers • Can handle better flash crowds Reduces client perceived response time • Bring content closer to end-users SDX (40) Distributed Web-based systems Content Distribution Networks • Using replication introduces some technical challenges: – What content to replicate? – Where to place replicated content? – How to enforce consistency between replicas? How to route a client request to a replica? • How to find the requested content? • How to choose between different replicas of the content? Redirection schemes SDX (41) Distributed Web-based systems Redirection schemes • URL rewriting – Origin server rewrites object URL links in returned pages to redirect clients to different CDN servers • DNS redirection – Origin server modifies its DNS zone file; DNS resolution is now controlled by CDN – Redirect requests to CDN servers according to a given policy (least loaded server, closest server to client, etc.) – Full-site or partial-site content delivery • Hybrid schemes can be also used SDX (42) Distributed Web-based systems URL rewriting example index.html <HTML> <BODY> <IMG SRC=“www.akamai1.net/www.yahoo.com/img1.gif”> <IMG SRC=“www.akamai2.net/www.yahoo.com/img2.gif”> <IMG SRC=“www.akamai3.net/www.yahoo.com/img3.gif”> </BODY> </HTML> SDX (43) Distributed Web-based systems DNS redirection example Origin server 111.222.100.1 www.yahoo.com/index.html GET index.html 10.20.30.1 <HTML> 10.20.30.1 (no 111.222.100.1) yahoo.com IP? 10.20.30.4 10.20.30.2 10.20.30.3 SDX (44) Distributed Web-based systems DNS server controlled by CDN Example: Akamai • World leader in CDN – 100,000 servers in 75 countries directly connected within nearly 1,000 networks • Uses a hybrid redirection scheme – DNS redirection – URL rewriting • Akamai Resource Locator (ARL): ‘Akamized’ URL http://www.foo.com/images/logo.gif http://a9.g.akamai.net/7/9/21/aaa7a80f016a2c/www.foo.com/images/logo.gif Serial number Akamai domain Type code Content provider code Object data Original URL SDX (45) Distributed Web-based systems Example: Akamai 1. DNS lookup for www.xyz.com 2. IP address for optimal Akamai server returned 3. Browser requests HTML from optimal Akamai server 4. Akamai server contacts central xyz.com server over the Internet if necessary 5. Akamai server assembles HTML and returns to browser 6. Browser resolves embedded links, obtaining IP addresses for optimal Akamai servers hosting those objects 7. Browser retrieves objects – e.g. dynamic content SDX (46) Distributed Web-based systems Contents • Introduction • Architecture • Communication • Naming • Synchronization • Consistency & replication • Fault tolerance SDX (47) Distributed Web-based systems Fault tolerance • Mainly achieved through client-side caching and server replication SDX (48) Distributed Web-based systems Summary • WWW as a distributed system – From traditional client-server architectures for fetching hyperlinked documents to Web Services – Use of standardized protocols • Communication: HTTP, SOAP • Naming: URI, UDDI – Caching and replication for improving performance and fault tolerance • e.g. web proxy caching, CDN SDX (49) Distributed Web-based systems