IS33 Data Warehousing and Mining Project Summary Applying Web Content Mining and Web Structure Mining to the Web Site of the Central Bank of Colombia Juan M. Cubillos 1. Introduction I am very interested in applying Web Mining techniques into a specific Web Site such as my company’s Web Site (http://www.banrep.gov.co), the central Bank of Colombia. This Web site is an important source of information for economy and finance in Colombia. Srivastava [SIVR, 2000] consider four classes of data available on the Web namely, (1) Web Content, (2) Web Structure, (3) Web usage, and (4) Web User profile. I would like to work in the area of Web Content Mining and Web structure Mining. 2. Expected Contributions An in-depth analysis of the Web Content Mining and Web structure Mining of The Central Bank of Colombia’s Web Site in order to suggest some adjustments to improve design, organization, and navigation of the Web site. To apply different Web Mining techniques and combinations of Web Content and Web Structure techniques in order to find interesting patterns. Develop some Web Mining tools by using a programming language such as PERL or JAVA. 3. Related Work The following are the most representative topics to develop my work: Web Content Mining and Text Mining Web Structure Mining Extracting Relational Data from HTML Repositories Mining Structures for Semantics Learning to extract information from Large Domain-specific Websites 4. Approach/methodology The following is my proposed methodology: Learn the application domain To get the Web Site by using a crawler or spider. Creating a target dataset Data preprocessing Choosing functions, algorithms and Mining techniques ( I plan to develop my own code) Mining for search patterns of interest Pattern evaluation 1 IS33 Data Warehousing and Mining Project Summary 5. Expected outcome I will show the most relevant patterns discovered in my research. The most relevant issues that I will address are as follows: Text Retrieval measures (Precision and Recall) Document Clustering Analysis Mining Web Page layout structure Mining Web’s link structures to identify authoritative web pages. According to the results, I will suggest some modification to the current Web site’s content and structure. I will show the main characteristics of my own tool to make Web Mining. 6. References [HAN, 2006] J. Han, M. Kamber. Data Mining Concepts and Techniques. 2006. Second Edition. Morgan Kaufman Publishers. [HAND, 2001] D. Hand, H. Mannila, P. Smith. Principles of Data Mining. 2002. MIT Press. [LARO, 2006] D. Larose. Data Mining Methods and Models. 2006. Wiley-Interscience. [LIU, 2002] B. Liu, K. Zhao, L. Yi. Visualizing Web Site Comparisons. 2002. ACM. 11th international conference on World Wide Web. 693 – 703. [LIU, 2006] B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. 2006. Springer. [LOTO, 2002] T. Loton. Web Content Mining with Java. Techniques for Exploring the World Wide Web. 2002. John Wiley & Sons LTD. [SIVR, 2000] J. Srivastava, R. Cooley, M Deshpande, P-N. Tan. Web Usage mining: Discovery and applications of usage patterns from Web data. SIGKDD Exploration, 1, 12-23 (2000) [YUEE, 2004] J. Yuee, L. Lakshmanan, R. Zamar. Extracting Relational Data from HTML Repositories. SIGKDD Exploration, Volume 6, Issue 2 5-13 (2004). [ZHONG, 2003] N. Zhong, J. Liu, Y. Yao. Web Intelligence. 2003. Springer. 2