IS733-Project_Summar..

advertisement
IS33 Data Warehousing and Mining
Project Summary
Applying Web Content Mining and Web Structure Mining
to the Web Site of the Central Bank of Colombia
Juan M. Cubillos
1. Introduction
I am very interested in applying Web Mining techniques into a specific Web Site such as my
company’s Web Site (http://www.banrep.gov.co), the central Bank of Colombia. This Web site
is an important source of information for economy and finance in Colombia.
Srivastava [SIVR, 2000] consider four classes of data available on the Web namely, (1) Web
Content, (2) Web Structure, (3) Web usage, and (4) Web User profile. I would like to work in
the area of Web Content Mining and Web structure Mining.
2. Expected Contributions

An in-depth analysis of the Web Content Mining and Web structure Mining of The
Central Bank of Colombia’s Web Site in order to suggest some adjustments to improve
design, organization, and navigation of the Web site.

To apply different Web Mining techniques and combinations of Web Content and Web
Structure techniques in order to find interesting patterns.

Develop some Web Mining tools by using a programming language such as PERL or
JAVA.
3. Related Work
The following are the most representative topics to develop my work:





Web Content Mining and Text Mining
Web Structure Mining
Extracting Relational Data from HTML Repositories
Mining Structures for Semantics
Learning to extract information from Large Domain-specific Websites
4. Approach/methodology
The following is my proposed methodology:







Learn the application domain
To get the Web Site by using a crawler or spider.
Creating a target dataset
Data preprocessing
Choosing functions, algorithms and Mining techniques ( I plan to develop my own code)
Mining for search patterns of interest
Pattern evaluation
1
IS33 Data Warehousing and Mining
Project Summary
5. Expected outcome

I will show the most relevant patterns discovered in my research. The most relevant
issues that I will address are as follows:




Text Retrieval measures (Precision and Recall)
Document Clustering Analysis
Mining Web Page layout structure
Mining Web’s link structures to identify authoritative web pages.

According to the results, I will suggest some modification to the current Web site’s
content and structure.

I will show the main characteristics of my own tool to make Web Mining.
6. References
[HAN, 2006] J. Han, M. Kamber. Data Mining Concepts and Techniques. 2006. Second
Edition. Morgan Kaufman Publishers.
[HAND, 2001] D. Hand, H. Mannila, P. Smith. Principles of Data Mining. 2002. MIT Press.
[LARO, 2006] D. Larose. Data Mining Methods and Models. 2006. Wiley-Interscience.
[LIU, 2002] B. Liu, K. Zhao, L. Yi. Visualizing Web Site Comparisons. 2002. ACM. 11th
international conference on World Wide Web. 693 – 703.
[LIU, 2006] B. Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. 2006.
Springer.
[LOTO, 2002] T. Loton. Web Content Mining with Java. Techniques for Exploring the World
Wide Web. 2002. John Wiley & Sons LTD.
[SIVR, 2000] J. Srivastava, R. Cooley, M Deshpande, P-N. Tan. Web Usage mining:
Discovery and applications of usage patterns from Web data. SIGKDD Exploration, 1, 12-23
(2000)
[YUEE, 2004] J. Yuee, L. Lakshmanan, R. Zamar. Extracting Relational Data from HTML
Repositories. SIGKDD Exploration, Volume 6, Issue 2 5-13 (2004).
[ZHONG, 2003] N. Zhong, J. Liu, Y. Yao. Web Intelligence. 2003. Springer.
2
Download