Intelligent Web Topics Search Using Early Detection and Data Analysis Ching-Cheng Lee, Yixin Yang Mathematics and Computer Science, California State University at Hayward Hayward, California 94542 Abstract Topic-specific search engines that offered users relevant topics as search results have recently been developed. However, these topic-specific search engines require intensive human efforts to build and maintain. In addition, they visit many irrelevant pages. In our project, we propose a new approach for Web topics search. First, we ………...Example Format for Abstract…Here.….. THIS IS USED FOR EXAMPLE rance information such as appearance times and places for candidate topics. By these two techniques, we can reduce candidate topics’ crawling times and computing cost. Analysis of the results and the comparisons with related research will be made to demonstrate the effectiveness of our approach. 1. Background Search engines are widely used for information access in World Wide Web. Today’s conventional search engines are designed to crawl and index the web, and produce giant hyperlink databases. These search engines may return hundreds or more links to user's queries provided that the right key word is used in the query. Despite the power of searching, these search engines lack the capabilities of finding the relevant Web sites for a giving specific topic. It is important to add capabilities to search engines that provide topic-related search. Various topic-related search systems have recently been proposed. One approach is the topic-specific search engine Mathsearch. This search engine offers higher quality search results, but it requires intensive human efforts [12]. J. Cho, H Garcia-Molina and L. Page proposed an efficient crawling through Uniform Resource Locator (URL) ordering scheme [7]. Their scheme guides the web crawler to rank web pages on both content and link measure. But the disadvantages of their system are that it requires scanning of the entire text of web page and does not account for topic relevance. Another approach is Focused Crawler [2] that utilizes both web link structure information and content similarity (based on document classification). The main drawback of this system is that it visits too many irrelevant pages and requires additional seed pages in order to add new topics. Most recently, J. Yi and N. Sundaresan [6] proposed an effective web-mining algorithm called topic expansion to discover relevant topics during the search. This scheme avoids visiting unnecessary Web pages and the whole text from a web page, and it does not need intensive human effort. The system mines the relevant topic as follows. First, it collects large number of Web pages. Then it extracts words from the text that is contained inside HTML document tags. From the extracted words, this system selects some words that are potentially relevant to the target topic. Finally, the system uses a formula and a relationship-based architecture for finding the relation between words to refine and return the relevant topics. Although the topic expansion scheme is far better than other schemes, it has some drawbacks in that it still needs much human involvement and multiple iterations of Web sites crawling. The number of Web crawling times depends on the number of the given candidate topics. Additionally, if a new incoming word does not appears in this relationship-based architecture, the new incoming word will not easily be detected although this word is highly relevant to the target topic. In this project, we propose a new scheme that uses early detection and data analysis techniques for detecting and analyzing candidate topics. By these, we can reduce candidate topics’ crawling times and computing cost to make the system more efficient. 2. The Research 2.1. System Architecture Our system consists of the following six components: a web crawler, a page parser, a stop word filter, a candidate topic selector, a candidate topic filter, and a relevant database as illustrated in Figure 1. Web Pages Web crawler Stop word Filter Candidate topic selector HTML parser Candidate topic filter Relevant term database Figure 1. System Architecture The core parts of the system will be the candidate topic selector and the candidate topic filter. Candidate topic selector performs early detection when words are extracted. Candidate topic filter analyzes occurring times and places of candidate topics to eliminate the false candidate topics and find brand new Internet words. By applying the early detection and data analysis methods in these two components, our system can reduce candidate topics’ crawling times and computing cost, thus make the system more efficient. In the following, we describe each of the component in more details. 2.2. System Components The system is implemented as follows: Web Crawler: The Web Crawler crawls the World Wide Web, retrieves and processes large amount of web pages. Our web crawler is a fast, memory-limited and highly flexible web crawler. In addition, our web crawler has two features not found in most web crawlers. They are: (1) It detects the file types, if a hyperlink references to some big ZIP files, application files, audio and video files, this web crawler will detect and choose to skip these large files. (2) It can avoid visiting irrelevant web pages. If our web crawler detects that a web page is not relevant to the target topic, it can just avoid retrieving most of the URLs in this web page. Therefore, in our implementation, the web crawler is more efficient and robust in that it won’t choke on unreadable file formats. In order to determine the file types, our system uses methods in java.net package prepared by java 2 Platform SE 1.3 to determine the type of the remote object, if the remote object is of type such as ZIP files, application files, image files, etc., our web crawler will skip these unreadable or too-large files. By this way, we get wellformatted HTML pages. Then our web crawler will pass the web page to the page parser (other layers such as stop words filer and candidate topic selector will be invoked from then on). Candidate topic selector will perform part of the early detection in that if the system detects a web page is obviously irrelevant to the target topic the candidate topic our web crawler will avoid retrieving most of the URLs in that web page to make our system more efficient. HTML Parser: The HTML Parser parses and extracts metadata as text from the downloaded web pages. In this project, we choose to use anchor text metadata as metadata information. Because anchor text is most frequently occurred in a HTML document and is most reliable one among four kinds of hyperlink metadata [6]. Page parser is implemented as follows: When the page parser receives a well-formed web document from the crawler, it parses the document for elements with names corresponding to the hyperlink elements (A for anchor tags, IMG for image tags, etc.), and extracts the attribute values. Stop Words Filter: The Stop Words Filter gets the words from the HTML parser and filters all the common words and words that are obviously irrelevant to the target topic. Common words are words that are too popular or too generic such as “welcome”, “best”, “a”, “that”, “is”,” in” etc., Stop words filter could be target-specific and needs minimal human interference in the process of this topic-mining scheme. After this step, words are passed onto candidate topic selector. All stop words are stored in a database table for easy access and easy update. User may update the stop words database in the middle of a mining iteration, and a set of common words can be reused for different target topics. Stop words filter checks every term, if a term is in the stop words database, then the term will be disqualified as candidate topics. detected. After this step, the detected relevant terms will be stored in the relevant term database described below for future uses as knowledge base of search engines. In our system, for each candidate topic, we use a simple but adequate formula to filter them: Candidate Topic Selector: The Candidate Topic Selector selects candidate topics with respect to the specified target topic. It performs early detection of candidate topics and all terms that are co-occurred with the target topic can be considered as candidate topic. The candidate topic selector is implemented as follows: The Candidate Topic Selector receives words from stop words filter. For every words, candidate topic selector computes and updates occur times of each word and the times the word co-occurred with the target topic. A record for this word is stored in the database. Every word (excluding the stop words) will be kept in database as a record; the records are for the relevance computations in subsequent components. The same candidate selection will be applied to all anchor text metadata in a page, and the system will recursively perform the same procedure for all the crawled URLs. When a user chooses to stop selecting candidate topics or the number of URLs crawled reaches a user-specified count, candidate topic selector will be stopped. Only the words that are frequently co-occurred with the target topic can be finally deemed as candidate topics. User can also define the smallest co-occurred numbers as a criterion for candidate topics. If the co-occurred time is less than the smallest co-occurred number, the word cannot be considered as the candidate topic. In this formula, co is the number of times a candidate topic co-occurred with the target topic. to is the total occur times for a candidate topic. MAXco is the maxim number of co among all candidate topics. r stands for the filtering metric; c is the user-defined threshold. Candidate Topic Filter: The Candidate Topic Filter is activated at the end of mining, it retrieves candidate topics from the database, and then uses a simplified yet powerful formula to compute the relevance of each candidate topic and decide whether to keep the candidate topic as relevant word or not. By analyzing the candidate topics’ occurrence information, relevant words can be r co co * c to MAXco Relevant Term Database: The Relevant Term Database contains the relevant topic data for the user-specified target topic. A database will be created with three tables defined: candidate_topics, relevant_terms and stop_words. Stop_words table stores words that are too popular or obviously irrelevant to the target topic; candidate_topics table store words that are coming from extracted metadata and are potentially relate to the target topic; the relevant_terms table is for the final results. 3. Experimentation 3.1. The Experiments Our EXPERIMENT HERE FOR EXAMPLE -DOS Prompt application from any windows system such as Windows XP, Windows ME etc. JDK1.4.1 and Microsoft Access database are used as our develop tools. To connect to the database, an ODBC data source must be registered with the system through the ODBC data sources. When the program starts, a java GUI window will pop up as in Figure 2. User needs to enter a seed URL and the target topic, and then the program will start to run. User may use any URL as the seed URL, but for best result, user can use some domain control knowledge to decide the seed URL. Figure 2. The Java GUI Window Our system has the following features, besides the simplified filtering algorithm: - It detects the file types and handle correspondingly. For example, if a hyperlink references to some ZIP files, application files, audio or video files, the system will detect and choose to skip these large files. - It detects the irrelevance of some pages thus avoid visiting irrelevant web pages. If our system detects a web page is irrelevant to the target topic, which page’s chance of getting visited is much less than other seemingly relevant pages, thus it can help our system staying in the right mining direction. Most of URL or web page errors are handled in a nice manner by the system, such as URL referring to an empty or removed page, server or client down. This feature makes our system most robust. We use the following criteria to evaluate the algorithm: (1) The algorithm of algorithm should be evaluated by considering the number of false inclusions (meaning irrelevant terms are falsely considered as relevant). The false inclusions should be minimized. (2) The quality of algorithm should be evaluated by considering number of relevant terms found versus number of web pages crawled. At the same time, it takes into accounts of the number of false inclusions. 3.2. Results and Comparisons Table 1 is a summary of the mining results for target topic “XML”. By applying our scheme to the hyperlink metadata of EXAMPLR HERE ages, our system produced sets of relevant topics with good quality. In table 1, the “Actual relevant topics” in left column means the actual relevant topics detected from candidate topics by people ( not by our application) this filed “Actual relevant topics” is used for check our results. “Relevant topics by c=0.055” means by using c=0.055 and run our application, how many relevant topics detected by our application. False inclusion refers to the irrelevant topic terms that are included in the set of relevant topics by our application. False exclusion refers to the relevant topics that are not included in the set of relevant topics by the algorithm. As seen from the above results, our mining scheme showed low false inclusion, for instance, our system produced low rate of false inclusion with 0 false inclusions in 51 relevant topics in the 4th iteration. Every time a user runs the application and update the stop word database, these steps called iteration. In terms of number of relevant terms vs. number of web pages crawled, our system exhibits very good result for relevant topics and shows our system are more efficient, comparing to the experiment results provided by topic expansion system. That system used 4 iterations with a total number of 34,000 web pages crawled to get 49 relevant topics out of 54 actual relevant topics; our system crawled only 17,000 web pages for obtaining 51 relevant topics out of 57 actual relevant topics by 4 iterations. # of pages crawled Candidate topics 1st Iteration 2nd Iteration 3rd Iteration 4th Iteration EXAMPLE- EXAMPLE ……… 7000 HERE- HERE. 141 167 RESULT RESULT HERE HERE Actual relevant topics 25 36 45 57 Relevant topics by c=0.055 EXAMPLE RESULT 46 51 HERE HERE RESULT RESULT RESULT 0 HERE HERE HERE 9 RESULT 5 False inclusion ( c=0.055) False exclusion (c=0.055) 6 HERE Table 1 Experiment for candidate topic “XML” Another improvement of our system is that users do not have to create a relation-based architecture. In other similar systems, most of them need a relationship-based hierarchy. In topic expansion system, users have to expand the topic hierarchy after each iteration to help next iteration get better results; In our system, after each iteration, users may choose to add more stop words (maybe target-specific) to the stop words table but user don’t have to understand very well about the relationship between words. One more thing we have to realize is that in reality there is no ultimate way of eliminating false exclusion (meaning relevant terms in certain crawled pages are falsely considered irrelevant by the algorithm). Other research claimed that they can minimized false exclusion from candidate topics to final relevant topics,. What they did is to reduce the candidate topics set. However the downside of their approach is the exclusion of a lot of potential relevant terms from candidate topics set, thus the completeness of candidate set is greatly compromised. On the other hand, in our project, false exclusion is empirically implemented by the filtering function, while maintaining the completeness of candidate set. 4. Conclusions Contrary to commercial search engines that output as many results as possible including irrelevant garbage pages, the topic relevance search's philosophy is to make sure that the correctness of relevance is found while trying to achieve the completeness. Since, when mining for the relevance, huge number of web pages need to be crawled, the efficiency becomes a very important issue. In this project, we developed a new approach for relevant topic search and demonstrated that our system can efficiently mine topic-specific relevance words by much smaller number of crawling times. We used early detection method to select candidate topics from the text inside HTML document’s anchor text metadata. Our system analyzed the occurrence times and locations of candidate topics in the filtering step so that false candidate topics can be successfully eliminated and words relevant to the user specified target could be found. 5. References [1] H.B.P. Pirolli, J.Pitkow, and R.Lukose. Strong regularities in world wide web surfing. Since,(280):95-97, Apr. 1998. [2] S. Chakrabarti, M van den Berg, and B, Dom. Focused crawling : A new approach to topic-specific web resource discovery. In Proc. of the 8th International World Wide Web Conference, Toronto,Canada, May 1999. [3] A. Sugiura and Etzioni, Query routing for web search engines: Architecture and experiments. In Proc. of the 9th International World Wide Web Conference,Amsterdam, The Netherlands, May 2000. [4] A. McCallum, K. Nigam, J.Rennie, and K.Seymore. Building domain-specific search engines with machine learning techniques. In AAAI Spring Symposium, 1999. [5] S. Lawrence and L. Giles. Accessibility and distribution of information on web. Nature, 400, pages 107-109, July 1999. [6] Jeonghee Yi, N. Sundaresan. Metadata Based Web Mining for Relevance. In IEEE 2000 International Database Engineering and Applications Symposium (IDEAS'00) Yokohama, Japan, September 18 - 20, 2000. [7] J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In Proc. of the 7th International World Wide Web Conference, Brisbane,Australia, Apr, 1998. [8] K. Bharat and M. Henzinger. Improved algorithms for topic distillation in hyperlinked environments. In Proc. of 21st International ACM SIGIR Conference, Melbourne, Australia, 1998. [9] J. Kleinberg. Authoritative sources in hyperlinked environment. In Proc. of 9th ACM-SIAM Symposium on Discrete Algorithm. [10] H. Chen, Y. Chung, M. Ramsey and C. Yang. A smart itsy bitsy spider for the web. Journal of American Society of Information Science, 49(7):604-618,1998. [11] S. Charkrabarti, B. Dom, P. Raghavan, S. rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proc. of the 7th International World Wide Web Conference, Brisbane, Australia, Apr, 1998. [12] http://www.maths.usyd.edu.au/MathSearch.html