COS 109 Monday November 23 • Housekeeping – Lab 6 and Problem Set 7 due dates Lab 6 is due by midnight on Friday November 27 Problem Set 7 is due by 5 PM on Monday November 30 – Because these deadlines have been extended, there will be no further extensions – Final exam – January 18 (Monday) at 7:30PM • Today’s class – A few more words about the internet – The World Wide Web Grades on Problem set 6 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Average score 35.8; a few people did not complete the assignment The geography of the internet • In 2012, there were 903.9 million Internet hosts – – – – – – USA Japan Brazil Italy China Germany 505M (498M in 2011) 64.5M 26.6M 25.7M 20.6M 20.0M – – – – – … Iraq Guam North Korea Chad 26 23 8 6 Source CIA Factbook Internet Users WorldWide • Internet Users (2014 Est.) – – – – – – – – – China European Union USA India Japan Brazil Russia Germany Nigeria 626M 398M 276.6M 237.3M 109.3M 108.2M 84.4M 70.3M 66.6M – Total WorldWide 3.2B The backbone of the internet • http://upload.wikimedia.org/wikipedia/commons/d/d2/Internet_ma p_1024.jpg • http://internet-map.net/ Lets register an internet domain • http://www.directnic.com Who manages this? • • • Internet Corp. for Assigned Names and Numbers (ICANN) – Formed in October 1998, – non-profit, private-sector corporation – broad coalition of the Internet's business, technical, academic, and user communities. – recognized by the U.S. and other governments as the global consensus entity to coordinate the technical management of the Internet's domain name system, the allocation of IP address space, the assignment of protocol parameters, and the management of the root server system. – funded through the many registries and registrars that comprise the global domain name and Internet addressing systems. ICANN was formed in 1998. It is a not-for-profit public-benefit corporation with participants from all over the world dedicated to keeping the Internet secure, stable and interoperable. It promotes competition and develops policy on the Internet’s unique identifiers.* ICANN doesn’t control content on the Internet. It cannot stop spam and it doesn’t deal with access to the Internet. But through its coordination role of the Internet’s naming system, it does have an important impact * From http://www.icann.org/en/about/ on the expansion and evolution of the Internet.* What does ICANN govern • DNS – domain name system – Relates names to numbers • TLD – top level domains – Originally there were 7 .com, .edu, .gov, .int, .mil, net, .org – 200+ country code top level domains – 1000+ gTLD (generic top level domains) – ..academy, .accountant, .apartments, .biz, .black, .cool, .dad, .money, .ooo, .sucks, .vodka, .xxx, .zone – More are here • Management – One company (called a registry) is in charge of each TLD. – A large number of companies (called registrars) can sell (and manage) names within a TLD How does ICANN govern • • • • Draws up contracts with each registry Runs an accreditation system for registrars Oversees IP addresses (through companies) Oversees root servers – Root servers are 13 addresses on the Internet where complete address tables can be found What about the root servers? • What do they do? – Ultimately resolve addresses With help from top level domains Cs.princeton.edu .edu TLD to find princeton princeton.edu to find cs.princeton.edu – But things change slowly, so There are intermediate name servers which cache addresses Very few address queries actually come to a root server. List of root servers Hostname a.root-servers.net IP Addresses 198.41.0.4, 2001:503:ba3e::2:30 Manager VeriSign, Inc. b.root-servers.net 192.228.79.201, 2001:500:84::b University of Southern California (ISI) c.root-servers.net 192.33.4.12, 2001:500:2::c Cogent Communications d.root-servers.net 199.7.91.13, 2001:500:2d::d University of Maryland e.root-servers.net 192.203.230.10 NASA (Ames Research Center) f.root-servers.net 192.5.5.241, 2001:500:2f::f Internet Systems Consortium, Inc. g.root-servers.net 192.112.36.4 US Department of Defense (NIC) h.root-servers.net 128.63.2.53, 2001:500:1::803f:235 US Army (Research Lab) i.root-servers.net 192.36.148.17, 2001:7fe::53 Netnod j.root-servers.net 192.58.128.30, 2001:503:c27::2:30 VeriSign, Inc. k.root-servers.net 193.0.14.129, 2001:7fd::1 RIPE NCC l.root-servers.net 199.7.83.42, 2001:500:3::42 ICANN m.root-servers.net 202.12.27.33, 2001:dc3::35 WIDE Project Root servers • Some are fixed in location (unicast) • Others are distributed (anycast) – Queries are routed to the topologically closest of a group of receivers all identified by the same destination address. – So, a decentralized service is provided. – Anycase servers can be used to distribute the impact of a distributed denial of service (DDoS) atack and so reduce its impact. And where are they? Details at http://www.root-servers.org/ Peering points • There are several hundred such points • Largest is Deutscher Commercial Internet Exchange with 650+ members and a peak speed of 5000 Gbit/sec (average speed 3000 Gbit/sec) of connected capacity and an average thruput of 1061 Gbit/sec Quick Facts (100% up time since 1997) Summarizing internet Ideas • packets versus circuits – different models (mail vs phone) • names and addresses – what is a computer called, how to find it • routing – how to get from here to there • protocols and standards – Internet works because of IP as common mechanism higher level protocols all use IP specific hardware technologies carry IP packets • layering – divide system into layers each of which provides services to next higher level while calling on service of next lower level – a way to organize and control complexity, hide details Summarizing internet technical issues: • privacy & security are hard – data passes through shared unregulated dispersed media and sites scattered over the whole world – it's hard to control access & protect information along the way – many network technologies (e.g., Ethernet, wireless) use broadcast encryption necessary to maintain privacy – many mechanisms are not robust against intentional misuse – it's easy to lie about who you are • service guarantees are hard – no assurance of reliable delivery, let alone of bandwidth, delay or jitter • some resources are running low – IPv4 addresses are pretty much all assigned – IPv6 (the next generation) uses 128-bit addresses acceptance growing, by necessity • but it has handled exponential growth amazingly well To summarize • How the internet works • And now that we’ve reached the end of the internet Website of the day • google trends Moving above internet pipes -- information flows to apps Higher level protocols • • • • SSH: secure login SMTP: mail transfer HTTP: hypertext transfer -> Web protocol layering: – – – – – a single protocol can't do everything higher-level protocols build elaborate operations out of simpler ones each layer uses only the services of the one directly below and provides the services expected by the layer above all communication is between peer levels: layer N destination receives exactly the object sent by layer N source application reliable transport service connectionless packet delivery service physical layer Encapsulation • each piece of data at one level is wrapped up with a header and sent as a packet at the next lower level • lowest level is what moves across specific network data HTTP TCP IP ether data data data data One particular app – the (World Wide) Web • a way to connect computers that provide information (servers) with computers that ask for it (clients like you and me) – uses the Internet, but it's not the same as the Internet • URL (uniform resource locator, e.g., http://www.amazon.com) – a way to specify what information to find, and where • HTTP (hypertext transfer protocol) – a way to request specific information from a server and get it back • HTML (hyptertext markup language) – a language for describing information for display • browser (Firefox, Safari, Internet Explorer, Opera, Chrome, …) – a program for making requests, and displaying results • embellishments – pictures, sounds, movies, ... – loadable software • the set of everything this provides Web history • 1989: Tim Berners-Lee at CERN – a way to make physics literature and research results accessible on the Internet • 1991: first software distributions • Feb 1993: Mosaic browser – Marc Andreessen at NCSA (Univ of Illinois) • Mar 1994: Netscape – first commercial browser • technical evolution managed by World Wide Web Consortium – non-profit organization at MIT, Berners-Lee is director – official definition of HTML and other web specifications – see www.w3.org HTTP: Hypertext transfer protocol • What happens when you click on a URL? • client opens TCP/IP connection to host, sends request GET /filename • server returns – header info – HTML HTTP/1.0 GET url server client HTML • since server returns the text, it can be created as needed – can contain encoded material of many different types (MIME) • URL format service://hostname/filename?other_stuff • filename?other_stuff part can encode – data values from client (forms) – request to run a program on server (cgi-bin) – anything else Embellishments • original design of HTTP just returns text to be displayed • now includes pictures, sound, video, ... – need helpers or plug-ins to display non-text content e.g., GIF, JPEG graphics; sound; movies • forms filled in by user – need a program on the server to interpret the information (cgi-bin) • cookies to remember information on client – HTTP is stateless: server doesn't saveanything from one request to next – cookies are a way to remember information at the client • active content: download code to run on the client – – – – Javascript Java applets plug-ins ActiveX Forms and CGI programs • "common gateway interface" – standard way to request the server to run a program – using information provided by the client via a form • if the target file on server is an executable program • and it has the right properties and permissions – e.g., in /cgi-bin directory and executable • then run it on server to produce HTML to send back to client – using the contents of the form as input – output depends on client request: created on the fly, not just a file • CGI programs can be written in any programming language – Perl, Python, PHP, Java, Ruby, … Example form in HTML (dpd.mycpanel2.princeton.edu/mailform.html) <html> <body> <form METHOD="post" ACTION="http://dpd.mycpanel2.princeton.edu/zcgi-bin/ mailform.cgi"> <input type="hidden" name="email" value=“cos109@princeton.edu"> Your name: <input type="text" name="name"><p> Your email: <input type="text" name="address"><p> Please rate this page:<p> <input type=radio name=rate value=poor> Poor <input type=radio name=rate value=ok> OK <input type=radio name=rate value=good> Good <p> <input type="submit"> <input type="reset"> </form> </body> </html> Cookies • HTTP is stateless: doesn't remember from one request to next • cookies intended to deal with stateless nature of HTTP – remember preferences, manage "shopping cart", etc. • cookie: one chunk of text sent by server to be stored on client – stored in browser while it is running (transient) – stored in client file system when browser terminates (persistent) • when client reconnects to same domain, browser sends the cookie back to the server – sent back verbatim; nothing added – sent back only to the same domain that sent it originally – contains no information that didn't originate with the server • in principle, pretty benign • but heavily used to monitor browsing habits, for commercial purposes Cookie crumbs • fetch a page from xyz.com – it contains <img src=http://doubleclick.com/advt.gif> – this causes a page to be fetched from DoubleClick.com – which now knows your IP address and what page you were looking at • DoubleClick sends back a suitable advertisement – with a cookie that identifies "you" at DoubleClick • next time you fetch any page that contains a DoubleClick.com image – the last DoubleClick cookie is sent back to DoubleClick – the set of sites and images that you are viewing is used to - update the record of where you have been and what you have looked at - send back targeted advertising (and a new cookie) Advertising marketplace • advertising exchanges – Yahoo Right Media, Doubleclick Ad Exchange, Facebook Atlas ... • a person uses a browser to request a web page • web page "publisher" notifies exchange that advertising space on that page is available – publishers are typically portals or entertainment and news sites – publisher provides information about the person: past online activity, viewing and shopping habits, geographic location, demographics probably not actual identity (?) • advertisers bid on the ad space – amount depends on person's attributes and location, advertiser's budget, etc. • winner's advertisement is inserted into the page • elapsed time: 10-100 milliseconds • this happens for multiple advertisements on one page