Berardino: Web Mining 1 Bailey Berardino February 19, 2012 ILS 655: Digital Libraries DL Technology Exploration Web Mining “The web is perhaps the single largest data source in the world” (Lui, 2005, p. 2). The World Wide Web is full of data, or raw information, available for use. Web mining looks to extract that information and gain useful knowledge from it. There are many different definitions of web mining. Cooley and Mobasher (1997) defined it as “the discovery and analysis of useful information from the World Wide Web. (p.1)” Kosala and Blockeel state that web mining “refers to the overall process of discovering potentially useful and previously unknown information or knowledge from web data” (2000, p. 3). So why mine data from the web? As stated before the Web has an unprecedented amount of data in it. According to Lui, the size of the web leads to both opportunities and challenges. Information on the Web is easily accessed, it is diverse and on almost anything. It comes in all forms and types, and much of the Web is linked together. Also, much of the Web is redundant. There are different levels to the Web (Surface and Deep) and it is dynamic, always changing. Lastly, it is a “virtual society,” as much about interactions as anything else (Lui, 2005). Web mining looks to extract the useful information, or data, and analyze it. Web Mining and Data Mining There is some confusion over the difference between web mining and data mining. Web mining is different from both data mining and text mining. Data mining is also known as Knowledge Discovery in Databases (KDD). It is discovering useful patterns of knowledge from data sources (Lui, 2007). There are similarities because many data mining techniques are used in web mining, and much of the web is text, so it is related to text mining as well. However, it does Berardino: Web Mining 2 differ because web data is mainly semi or un-structured, while data mining is mostly structured. According to Lui, “Web mining aims to discover useful information or knowledge from the Web hyperlink structure, page content, and usage data. Although Web mining uses many data mining techniques, as mentioned above it, is not purely an application of traditional data mining due to the heterogeneity and semi-structured or unstructured nature of the Web data (2007, p. 6-7). Both data and web mining usually consist of a three step process. 1. Pre-processing: Cleaning up the raw data, make it easier to mine for knowledge 2. Mining Process: Feeding data to data mining algorithms to produce pattern or knowledge. The quality of the algorithm can be measured both by how effective it is and how efficient it is. 3. Post-processing/analysis: Identifying useful patterns (if any). Although the three steps seen in both types of mining, they can be considerably different in application. There are three types of web mining: web usage mining, web structure mining and web content mining. Web usage mining focuses on the discovery of user access patters from clientserver transactions. Web structure mining gains useful knowledge from the structure of websites and hyperlinks. Lastly, web content mining extracts useful information from a web page’s contents. The following chart from Kosala and Blockeel (2000, p.5) looks at the difference between the three. Berardino: Web Mining 3 QuickTime™ and a decompressor are needed to see this picture. Web Content Mining Web content mining can be defined as the process of extracting useful information from the contents of web documents. According to Sirvastava (n.d.) “Content data corresponds to the collection of facts a Web page was designed to convey to the users. It may consist of text, images, audio, video, or structured records such as lists and tables” (slide 32). This field also involves using technologies such as Information Retrieval (IR) and Natural Language Processing (NLP). People who would be interested in content mining would be researchers, those involved in ecommerce, government officials and many more. Web content mining has many applications and implications. It can help categorize web documents and find web pages across different servers that are similar. In the world of Ecommerce it could help companies better understand the needs of their customers. It could lead to target marketing and more personalized service. The government uses web content mining to track and classify possible criminal and terrorist threats. One of the larger implication of web content mining is the invasion of privacy that can come with it, to be discussed more later. Berardino: Web Mining 4 Web Structure Mining QuickTime™ and a decompressor are needed to see this picture. Web Structure mining is the process of discovering structure information from the web. It can be performed at the document level (intra-page) or at the hyperlink level (inter-page), which is called hyperlink analysis. “The structure of a typical web graph consists of web pages a nodes, and hyperlinks edges connecting between two relevant pages.” One of the key ideas is ranking a web page depending on the rank of the web pages pointing at it (Sirvastava). There are also many applications for web structure mining. It can be useful in gathering information on the quality of a web page based on its authority and ranking. It helps in studying interesting web structures, graphing patterns such as co-citation or social choice. Other applications include web page classification, finding related pages and detection of duplicate pages (Sirvastava). Web Usage Mining Sirvastava defines a web as a collection of inter-related fines on one or more web servers. Web usage mining is “the discovery of meaningful patterns from data generated by client-server QuickTime™ and a decompressor are needed to see this picture. Berardino: Web Mining 5 transitions on one or more web locations” (slide 64). The data that is studied can come from a number of sources. Servers can auto-generate data such as access log, agent logs, client-side cookies. Other sources are user profiles or metadata such as page attributes, content attributes and usage data. Many of the implication for web usage mining are in the field of E-commerce. It allows a company to determine the lifetime value of a client. It also helps in designing cross marketing strategies and evaluating promotional campaigns. It will allow the company to target ads and coupons at user groups based on their access patterns. It can also predict user behavior patterns and present more dynamic information to users based on their interest and profiles. Other implications include more effective and efficient web presence. It allows the creator to determine best way to structure the site and pre-fetch files likely to be access. Intra-organization applications (within an organization) include enhancing work group management and communication. It can also evaluate the effectiveness of the intranet and identify possible structural needs. Techniques There are a number of different techniques for web mining, but the most popular are: 1) Classification/Supervised Learning: This is most the frequent. Classes or categories are pre-define. Documents are assigned to one or more existing categories. 2) Clustering/Unsupervised: No pre-defined categories, but algorithms put data into groups or clusters. 3) Association rules: Finds sets of data items that occur frequently together. 4) Sequential patterns: Finds sets of data items that occur frequently together in some sequences. Berardino: Web Mining 6 Future The applications and implications of web mining are just beginning. Much like the internet itself, it is a very dynamic field. Some look to web mining as an apparatus for behavior experiments, as well as a way to study the evolution of the web and its patterns. Mining of emails for information is also another way to target marking and for social networking. Personalization of the web is big topic in the field. Mulvenna, Anand, and Buchner (2000) state that “web mining provides the tools to analyze web log data in a user-centric manner (p. 124). They discuss adaptive internet sites that auto improve the organization and presentation when they learn from their visitor’s patterns. Web personalization can be defined as “an action that adapts the information or services provided by a web site to the needs of a particular user or set of users taking advantage or the knowledge gained form the user’s navigational behaviors and individual interests, in combination with the content and structure of the website” (Eirinaki and Vazirgiannis, 2003, p. 1-2). There are many ways that web mining can lead to value added services for users. Zaiane (2001) also looked at how web mining could improved web-based distance learning. Much of the technology that is used to gather information for e-commerce can be applied to the educational setting. At the time there had been little done to study learners patterns, and it seems this is an area that needs further research. However, web mining techniques could be applied to better evaluate the learning process, and perhaps personalize it as well. Web mining also leads to very serious privacy concern. Most people do not know the extent of how much information is extracted from them and their internet usage. Web mining can also lead to a number of threats including: identity theft, defamation, industrial espionage, Berardino: Web Mining 7 ransom (electronic), vandalism and market manipulation. Some of the steps that need to be taken to address these concerns included more awareness and education, more regulations and laws and better technology for security and auditing (Sirvastava). Web mining has the potential to be a great asset for both consumers and corporations. It can also help with new fields of research and knowledge. It can help create new knowledge out of old knowledge. But as with all new technology there are serious ethic considerations in the usage of these techniques that should be further discussed and evaluated. Berardino: Web Mining 8 References Cooley, R., Mobasher, B. (1997). Web mining: Information and pattern discovery in the world wide web. Retrieved from http://maya.cs.depaul.edu/classes/ect584/papers/cms-tai.pdf Eirinaki, M. and Vazirgiannis, M. (2003). Web Mining for Web Personalization. ACM Transactions on Internet Technology, 3(1), 1–27. http://www.inf.ufsc.br/~lumar/datamining/p1-eirinaki.pdf Kosala, R and Blockeel, H. (2000). Web mining research: A survey. Retrieved from http://www.umiacs.umd.edu/~joseph/classes/enee752/Fall09/survey-2000.pdf Lui, B. (2005). Web content mining [Tutorial]. The 14th International World Wide Web Conference. Retrieved from http://www.cs.uic.edu/~liub/Web-Content-Mining-2.pdf Lui, B. (2007) Web data mining: Exploring hyperlinks, contents, and usage data. New York: Springer Berlin Heidelberg. Retrieved from http://www.mendeley.com/research/datamining-web/ Mulvenna, M., Anand, S., & Buchner, A. (2000). Personalization on the net using web mining. Communications Of The ACM, 43(8), 122-125. http://0search.ebscohost.com.www.consuls.org/login.aspx?direct=true&db=lxh&AN=ISTA3501 938&site=ehost-live Sirvastava, J. (n.d.). Web mining: Accomplishments and future directions [PowerPoint]. Retrieved from http://ignatius.atw.hu/mining.pdf Zaiane, O. (2001). Web Usage Mining for a Better Web-Based Learning Environment. Retrieved from http://scholar.googleusercontent.com/scholar?q=cache:iJA2jZHK8sEJ:scholar.google.co m/+web+mining&hl=en&as_sdt=0,7