WEB Intelligence Contents • • • • Basic Web technology, HTML, CGI, HTTP XML-based standards XSLT, XPATH Web services, SOAP Computational Intelligence (as for instance Neural Networks) • Web Crawlers and focused Web crawlers • XML indexing/retrieval • Ranking The Origins of the WWW • WWW was invented by Tim Berners-Lee at CERN (1989) • Hypertext across the Internet (replacing FTP) • Three constituents: HTML + URL + HTTP • HTML is an SGML language for hypertext • URL is an notation for locating files on serves • HTTP is a high-level protocol for file transfers Web Servers HTTP request Web Client Web server Browser Response: HTML code –Client - Server model –Stateless Network Layers OUR APPLICATIONS THE APPLICATION LAYER HTTP, FTP, SMTP, DNS THE TRANSPORT LAYER TCP, UDP THE INTERNET LAYER THE NETWORK INTERFACE LAYER IP Ethernet HTTP HTTP request GET http://www.it.lth.se/ HTTP response 1. Envelope 2. A blank line 3. HTML code HTTP response example 1 HTTP/1.1 200 OK Date: Fri, 10 Feb 2006 13:50:53 GMT Server: Apache/1.3.29 (Debian GNU/Linux) PHP/4.3.3 Content-Length: 170 Content-Type: text/html Last-Modified: Fri, 10 Feb 2006 13:49:58 GMT 2 3 <html> <head><title>Example HTML file</title></head> <body> <h1>Anders Ardö</h1> He is teacher at Department of Information Technology. </body> </html> Anatomy of a WebPage • Head – Title – Meta: <meta name=”keywords” content=”HTML, WebPage”> – Style sheets • Body – Formating tags: H1, table, B, P, BR, UL, … – Input forms – Links: <a href="http://www.it.lth.se/">IT</a> – Styles Hypertext • Collections of document connected by hyperlinks • Paul Otlet, philosophical treatise (1934) • Vannevar Bush, hypothetical Memex system (1945) • Ted Nelson introduced hypertext (1968) • Hypermedia generalizes hypertext beyond text Markup Languages • Notation for adding formal structure to text • Charles Goldfarb, the INLINE system (1970) • Standard Generalized Markup Language, SGML (1986 The Design of HTML • Simple, purist design principles • HTML describes the logical structure of a document • Browsers are free to interpret tags differently • HTML is a lightweight file format • Size of file containing just ”Hello World!”: Postscript PDF MS Word HTML 11,274 bytes 4,915 bytes 19,456 bytes 28 bytes Simple Formatting (1/2) <html> <head> <title>Good Advice</title> </head> <body> <h1>Good Advice for Everyday Life</h1> <h2>For UNIX programmers</h2> <b>Never</b> type: <p><tt>rm -rf /*</tt><p> on your computer. <h2>For Nuclear Scientists</h2> <b>Never</b> press the <i>Big <font color="red">Red</font> Button</i>. </body> </html> Simple Formatting (2/2) Hyperlinks: Source Document <html> <head> <title>Source Document</title> </head> <body> <a href="target.html#danger">Better look here</a>. </body> </html> Hyperlinks: Target Document <html> <head> <title>Target Document</title> </head> <body> ... <a name="danger"></a> <h2>Chapter 17: Dangerous Shell Commands</h2> Never execute a shell command that inadvertently changes all vowels to the character 'x'. </body> </html> HTML Validity • • • • HTML has a formal syntax specification 800 lines of DTD notation A validator gives syntax errors for invalid documents Most HTML documents on the123 Web are invalid: www.microsoft.com errors www.cnn.com 58 errors www.ibm.com 30 errors www.google.com 27 errors www.sun.com 19 errors • Valid documents may contain this logo: Reasons for Invalidity • Ignorance of the HTML standard • Lack of testing – ”This page is optimized for the XYZ browser” – ”This page is best viewed in 1024x768” • Automatic tools generate invalid HTML output • Forgiving browsers try to interpret invalid input <h2>Lousy HTML</h1> <li><a>This is not very</b> good. <li><i>In fact, it is quite bad</em> </ul> But the browser does <a naem="goof">something. Problems with Invalidity • There are several different browsers • Each browsers has many different implementations • Each implementation must interpret invalid HTML • There are many arbitrary choices to make • The HTML standard has been undermined • HTML renders differently for most clients HTTP requests • GET: GET /path/to/file/index.html HTTP/1.0 • HEAD: HEAD /path/to/file/index.html HTTP/1.0 • POST: Adds data in the message body • and others … HTTP example GET /search?q=Introduction+to+XML+and+Web+Technologies HTTP/1.1 Host: www.google.com User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2) Gecko/20040803 Accept: text/xml,application/xml,application/xhtml+xml, text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Language: da,en-us;q=0.8,en;q=0.5,sw;q=0.3 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Referer: http://www.google.com/ Request line (methods: GET, POST, ...) Header lines Request body (empty here) HTTP Responses HTTP/1.1 200 OK Status line <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> Response <html>...</html> Header lines Connection: close Date: Thu, 16 Mar 2006 12:39:12 GMT Accept-Ranges: bytes ETag: "63062-0-41342c03" Server: Apache/1.3.29 (Debian GNU/Linux) PHP/4.3.3 Content-Length: 2820 Content-Type: text/html Last-Modified: Tue, 31 Aug 2004 07:42:59 GMT Client-Date: Thu, 16 Mar 2006 12:39:12 GMT Client-Peer: 130.235.4.69:80 Client-Response-Num: 1 Body HTTP return codes • 1xx informational message • 2xx success 200 OK • 3xx redirect 301 Moved permanently • 4xx client error 400 Bad Request 401 Unauthorized 403 Forbidden 404 Not Found • 5xx server error 500 Server error 503 Service Unavailable Static vs Dynamic Pages • Static - just copy a file from server to client • Dynamic - do some data processing • Parameters - CGI, Forms Dynamic Web Pages • • • • Answers to database queries Animated Web Pages User Dialogs Checking user input May be handled client side (JavaScript, Java applets, Flash, … Or server side Dynamic, server side • • • • • • CGI – Perl, Python, C, … ASP PHP Java Servlets Java Server Pages - JSP etc CGI - Common Gateway Interface • Webserver gets a request for a page with a special URL (/cgi-bin/…) • The CGI-script is started as an OS process • Script read parameters • Scipt outputs HTML-code • Script process terminates CGI problems • OS processes are expensive • State between invocations • Synchronization between processes Parameters HTML forms <h3>Search Lund University Departments</h3> <form action="http://www.lu.se/search.phtml“ method=“get"> Which database? <select name=“db"> <option value=“LTH">LTH</option> <option selected value=“LU">All LU</option> <option value=“IT">IT</option> </select><br> Please enter your question: <input type="text" name=“query"><br> <input type="submit" name="send" value="Go!"> </form> • HTML form Parameters • Encoded in the URL: – GET GET /cgi-bin/search.phtml?db=LU&query=masters+thesis HTTP/1.0 • Encoded in the message body: – POST POST /cgi-bin/search.phtml HTTP/1.0 Content-Type: application/x-www-form-urlencoded Content-Length: 26 db=LU&query=masters+thesis Encoding of Form Data Name Value db LU query masters thesis send Go! • Encoding to query string (URL encoding): db=LU&query=masters+thesis&send=Go%21 • GET: place parameter string in request URL http://.../search.phtml?db=LU&query=mast... • POST: place query string in request body Server side scripting PHP • general-purpose scripting language • suited for Web development • can be embedded into HTML • Have a lot of predefined modules and interfaces PHP example <html> <head> <title>PHP Test</title> </head> <body> <?php echo "<p>Hello World</p>\n"; ?> The time is <?php echo date(‘H:I:s’); ?> </body> </html> Uniform Resource Locator • A Web resource is located by a URL http://www.w3.org/TR/html4/ scheme server path • Relative URL sgml/dtd.html • Fragment identifier http://www.w3.org/TR/HTML4/#minitoc URIs, URNs • Uniform Resource Identifier (URI) scheme:scheme-specific-part Conventions about use of /, #, and ? • Uniform Resource Name (URN) urn:isbn:0-471-94128-X Sessions • But what if I’d like to implement a hit counter? Stateless => problems Session Management Techniques – URL rewriting – Hidden form fields – Cookies – SSL sessions Cookies • Extension of HTTP that allows servers to store data on the clients – limited size and number – may be disabled by the client • Set-Cookie: sessionid=21A9A8089C305319; path=/ • Cookie: sessionid=21A9A8089C305319 Regular expressions • is a very powerful way of extracting information (pieces of text) from a large document • Describes a pattern that is matched against the text Regular expressions • • • • • /Heja/ matches the string 'Heja' /Heja?/ matches the string 'Hej' and 'Heja' /^http:/ matches all lines that begin with 'http:' /\bFred\b/ matches 'Fred' but not 'Fredrick' /(\d+):(\d+):(\d+)/ matches for example times like 12:30:01 and groups hours into group 1, minutes into group 2, and seconds into group 3. • /http:\/\/([^\/]+)(\/[^\s]+)\s/ matches URLs and places the server in group 1 and the path in group 2. Regular expressions How match and extract ISBN numbers? • What is an ISBN number? • Format? • /isbn:?\s*([\d-x]+)/i