Chapter 7 Web Content Mining Xxxxxx Introduction • Web-content mining techniques are used to discover useful information from content on the web –textual –audio –video –still images –metadata –hyperlinks Introduction • Some of the web content is generated dynamically using queries to database management systems • Other web content may be hidden from general users Introduction • Problems with the web data – Distributed data – Large volume – Unstructured data – Redundant data – Quality of data – Extreme percentage volatile data – Varied data Introduction • Two approaches of web-content mining: –agent-based » software agents perform the content mining –database oriented » view the Web data as belonging to a database Web Crawler • A computer program that navigates the hypertext structure of the web – Crawlers are used to ease the formation of indexes used by search engines – The page(s) that the crawler begins with are called the seed URLs. • Every link from the first page is recorded and saved in a queue • Builds an index visiting number of pages and then replaces the current index – Known as a periodic crawler because it is activated periodically Web Crawler • Another type is a Focused Crawler – Generally recommended for use due to large size of the Web – Visits pages related to topics of interest • If a page is not pertinent, the entire set of possible pages below it is pruned Multiple Layered Database • Every layer of the database is more generalized than the layer below it • Unlike the lowest level, the upper levels are structured and can be mined by an SQL-like query language Multiple Layered Database • Provides an abstracted view of a fraction of the web • Virtual Web View (VWV), can be constructed Search Engine • Basic components to a search engine: –The spider gathers new or updated information on Internet websites –The index used to store information about several websites –The search software performs searching through the huge index in an effort to generate an ordered list of useful search results Types of Queries • Boolean Queries: – Boolean logic queries connect words in the search using operators such as AND or OR • Natural Language Queries: – In natural language queries the user frames as a question or a statement • Thesaurus Queries: – In a thesaurus query the user selects the term from a preceding set of terms predetermined by the retrieval system Types of Queries • Fuzzy Queries: – Fuzzy queries reflect no specificity • Term Searches: – The most common type of query on the Web is when a user provides a few words or phrases for the search • Probabilistic Queries: – Probabilistic queries refer to the way in which the IR system retrieves documents according to relevancy The Robot Exclusion • Why would the developers prefer to exclude robots from parts of their websites? • The robot exclusion protocol – to indicate restricted parts of the Website to robots that visit our site – for giving spiders (“robots”) limited access to a website The Robot Exclusion • Website administrators and content providers can limit robot activity through two mechanisms: – The Robots Exclusion Protocol is used by Website administrators to specify which parts of the site should not be visited by a robot, by providing a file called robots.txt on their site. – The Robots META Tag is a special html META tag that can be used in any Web page to indicate whether that page should be indexed, or parsed for links. Personalization of Web Content • Used to modify the contents of a web page as per the needs of a user – Essentially, this involves building web pages exclusively for each user Types of Web Page Personalization • Collaborative filtering: – • Manual techniques: – • Achieves personalization by suggesting Web pages that have earlier been given high ratings from similar users Perform personalization via the use of rules that are used to classify individuals based on profiles or demographics Content-based filtering: – Retrieves pages based on the similarity between them and user profiles Multimedia Information Retrieval • Perspective of images and videos • Content system for images is the Query by Image Content (QBIC) system: – A three-dimensional color feature vector, where distance measure is simple Euclidean distance. – k-dimensional color histograms, where the bins of the histogram can be chosen by a partition-based clustering algorithm. – A three-dimensional texture vector consisting of features that measure scale, directionality, and contrast. Distance is computed as a weighted Euclidean distance measure, where the default weights are inverse variances of the individual features. Multimedia Information Retrieval • The query can be expressed directly in terms of the feature representation itself – For instance, Find images that are 40% blue in color and contain a texture with specific coarseness property Multimedia Information Retrieval • MIR System www.hermitagemuseum.org/html_En/index.html • A QBIC Layout Search Demo that illustrates a step by step demonstration of the search described in the text can be found at: www.hermitagemuseum.org/fcgibin/db2www/qbicLayout.mac/qbic?selLang=English. Multimedia Information Retrieval • As multimedia become apparent as a more extensively used data format, it is vital to deal with the issues of: – – – – – metadata standards classification query matching presentation evaluation • To guarantee the development and deployment of efficient and effective multimedia information retrieval systems