The Technion Electrical Engineering Software Lab in the Technion Electrical Engineering Software Lab Design Document for Project Content based spam web pages detector Last Revision Date: 10.1.2006 Team members: Avishay Livne avishay.livne@gmail.com 066494576 Itzik Ben Basat sizik1@t2.technion.ac.il 033950734 Instructor Name: Maxim Gurevich 1 The Technion Electrical Engineering Software Lab Index 0 ABSTRACT ........................................................................................................................................... 3 1 INTRODUCTION ................................................................................................................................. 4 2 DESIGN GOALS AND REQUIREMENTS ....................................................................................... 4 3 DESIGN DECISIONS, PRINCIPLES AND CONSIDERATIONS .................................................. 5 4 SYSTEM ARCHITECTURE ............................................................................................................... 6 5 COMPONENT DESIGN ...................................................................................................................... 6 6 DATA DESIGN ..................................................................................................................................... 8 7 RESULTS............................................................................................................................................... 9 8 SUMMARY...........................................................................................................................................16 9 FUTURE WORK .................................................................................................................................16 10 DEFINITIONS, ACRONYMS, AND ABBREVIATIONS – ............................................................17 11 REFERENCE MATERIAL ................................................................................................................19 2 The Technion Electrical Engineering Software Lab Abstract This document is the design document for the Content Based Spam Web Pages Detector project. The project was done by Avishay Livne and Itzik Ben Basat in the software lab. This is an educational project with no commercial intentions. The project implements an application that classifies HTML web pages as spam or legal pages. The classification process is based purely on the content of the HTML source of the specific page. The classifier doesn't need any external data, such as links between different pages. Briefly, the classification process is compound of the following steps: 1. Parsing the to-be-classified pages in a format the application can handle, evaluating each of the page's attributes. 2. Constructing a decision tree, based on a manually tagged dataset of pages. Each page has a tag, marking it as spam or legal. 3. Running each of the pages through the decision tree and making a decision whether it's a spam or not. The theory behind this classification method is described in the paper "Detecting Spam Web Pages through Content Analysis" by Ntoulas et al. This paper was the theoretical background for the project. After implementing the classifier we tested its performance on different datasets and compared the results to those in the paper. In order to investigate the differences in the results we produced histograms that describe the distribution of each of the attributes. 3 The Technion Electrical Engineering Software Lab 1 Introduction Recently, spam has invaded Web as it earlier invaded e-mail. Web spammer’s goal is to artificially improve the ranking of a specific web-page within search engine. Numerous bogus web-sites are created and clog Web as e-mail spam clogs inboxes. More and more resources are directed by search engines to fight spam. Many search engines struggle to provide high-quality search results in the presence of spam. Recent study by Ntoulas et al. proposes an efficient technique to detect spam web pages. It analyzes certain features of a web page such as fraction of visible content, compressibility, etc. and then employs machine learning in order to build page classifier based on these features. The rest of the document describes the goals of the project, the requirements from the application, major design decisions, the system architecture, a deeper explanation of the system’s components and some data structures we used. 2 Design Goals and Requirements This part describes the goals of the project and gives a detailed requirements list of the application. The purpose of the project is to implement a web spam classifier, which given a web page, will analyze its features and try to determine whether the page is spam or not. The efficiency of the classifier will be compared to the results of Ntoulas et al. The main purposes of the project are: 1. Learning algorithms and methods to classify web pages as spam pages, by only analyzing the web pages. 2. Implementing the algorithms and methods in order to create a classifier that will detect web spam pages. 3. In this project we shall not use methods that track links and construct graphs that model the web. 4 The Technion Electrical Engineering Software Lab 4. After implementing the classifier we shall test its performance on data sets of web pages. The pages are tagged manually so we could get a good measure of the quality of our classifier. Other goals are: 1. Getting familiar with hot topic in the field of search engines. 2. Getting familiar with HTML parsing. 3. Getting familiar with machine learning and decisions trees. 4. Having fun 3 Design Decisions, Principles and Considerations The classification of a web page will be done according to the following attributes: 1. Domain name (if the URL is given) 2. Number of words in page 3. Number of words in the page’s title 4. Average length of words 5. Amount of anchor text 6. Fraction of visible content 7. Compressibility 8. Fraction of page drawn from globally popular words 9. Fraction of globally popular words In order to train our classifier we shall use free data sets of manually tagged HTML pages that we found on the internet. We shall implement our parser using HTMLParser, a popular HTML parser (more information can be found at http://htmlparser.sourceforge.net/). The decision making process will be done according to the C4.5 algorithm. We shall use jaDTi, an open source decision tree implemented in Java, to implement our decision tree. 5 The Technion Electrical Engineering Software Lab 4 System Architecture The architecture we designed is composed of two major components; each of them uses two major tools. The major components are the Trainer and the Classifier. Both of the classes use the Parser and the DecisionTree. The Trainer's input is a set of HTML pages manually tagged as spam or notspam. It'll create a new DecisionTree and will train it according to this dataset. After the whole training process is complete, the Trainer shall save the DecisionTree in a file, for future use of the Classifier. The Classifier's input is a set of HTML pages. The goal of the classifier is to tag each of the HTML pages as a spam or not-spam page correctly. In order to decide which page is spam or not the Parser will use the DecisionTree that was built by the Trainer. Both the Trainer and the Classifier uses the Parser to gather the wanted statistics of every HTML page. The Parser's input is an HTML page and its output is a list of attributes and their values. We considered combining the Trainer and the Classifier as one component, but decided it'll be better to split them up to create a clearer encapsulation and more organized distribution of authorities. The classes diagram of the project: 5 Component Design 5.1 The Parser: The Parser component is responsible for extracting the required statistics from each HTML page. In order to implement the Parser we use external library – HTMLParser. The Parser iterates a few times over the HTML source of the page and 6 The Technion Electrical Engineering Software Lab calculates the following statistics for each page: number of words in page, number of words in title, average length of words, amount of anchor text, fraction of visible content, fraction of page drowns from globally popular words and fraction of popular words. In addition the Parser compresses the page in order to calculate its compressibility. 5.2 The DecisionTree: The DecisionTree component is responsible for deciding whether a page is web spam or a legitimate web page. In order to build a new DecisionTree we supply it with a set of manually tagged pages. Each page in the data set was tagged by a person as "spam", "normal" or "undecided". For DecisionTree each page is represented as a map of attributes and values, according to them DecisionTree decides whether a webpage is spam or not. Once a DecisionTree is instantiated and built, it can be used for classifying untagged webpages. DecisionTree's output is the result of the classification algorithm – that is the value that represents how "spammy" the website is. 5.3 The Trainer: The Trainer component is responsible for training the decision tree. This component uses the parser to read a set of manually tagged HTML pages, instantiate a decision tree and trains it with this data set to create a welltrained decision tree. 5.4 The Classifier: The classifier component is responsible for classifying untagged HTML webpages. The final goal of this project is to utilize this component, using the other components of the project, to investigate its performance (mainly successful classification rate) and hopefully to use it in order to improve the experience of finding information on the internet and improve search engines' results. 7 The Technion Electrical Engineering Software Lab 6 Data Design This section describes the formats of the files used in the project, and a few data structures that were used in the project. 6.1 File Formats: The system uses the following files: parsed data-set file, decision tree file, and HTML webpages. Parsed data-set file contains a list of the pages and their attributes. This file is produced by the parser and is used by the DecisionTree. Below is the format of this file: <Data set name> <Attributes' names and types> <Page's address and its attributes' values> Example for parsed data set file (the "line#:" marks don't appear in the real file and are only for clarifying): line1: Dataset1 line2: object name wordsInPage numerical wordsInTitle numerical averageWordLength numerical anchorText numerical fractionOfVisibleContent numerical compressibility numerical fractionDrownFromPopularKeywords numerical fractionOfPopularKeywords numerical spam symbolic line3: http://2bmail.co.uk 371 5 4.0 0.0 0.0 2.0 0.0 0.0 undecided line4: http://4road.co.uk 278 18 5.0 0.0 0.0 7.0 0.0 0.0 spam line5: http://amazon.co.uk 780 11 4.0 0.0 0.0 5.0 0.0 0.0 normal Implementation issues: We had to modify the code of jaDTi so its file reader will be able to read HTTP addresses (the original code can't handle names with various chars like '.', '/', '-', '_' etc.). In addition to the parsed data set file, the system uses a file to store the DecisionTree which the Trainer is responsible to build. We're saving the DecisionTree on the disk using the ObjectOutputStream class. In order to save the DecisionTree class via ObjectOutputStream we had to modify the code of jaDTi, so all the relevant classes will implement the Serializable interface. 8 The Technion Electrical Engineering Software Lab 6.2 Data structure – PageAttributes class: The PageAttributes data structure is a representation an HTML webpage, which contains the information needed by the system's components. Each field contains the value of a different attribute of the webpage, therefore the fields are: Name – String. The HTTP address of the webpage. WordsInPage – long. The number of words in the webpage's text. WordsInTitle – long. The number of words in the webpage's title. AverageWordLength – double. The average number of letters in each word in the text. AnchorText – long. The number of words in the anchor text. FractionOfVisibleContent – double. The value of the ratio (amount of words in invisible text / amount of words in invisible text + amount of words in visible text). Invisible text is hidden text, usually appears under HTML tags like META or ALT attributes. Compressibility – double. The ratio (gzipped compressed file size / original file size). fractionDrownFromPopularKeywords - double. The value of the ratio (amount of popular keywords that appear in the text / amount of words in the text). fractionOfPopularKeywords – double. The value of the ratio (amount of popular keywords that appear in the text / amount of words in popular keywords' list). Tag – String. Possible values = {normal, undecided, spam}. In training sets this represents the value of the manually tagged page. This value tells the DecisionTree whether the page is a spam or a legitimate webpage. 7 Results This chapter summarizes the results of the experiments we executed in the project. In our experiments we constructed a few decision trees out of the given data set. We divided the data set to several chunks, each of them contains equal number of pages. For each of the chunks we built a decision tree, and tested the tree on the rest of the chunks. We measured the performance of each tree by calculating the following values: Match Rate (MR): the amount of correct marks the DT made out of total observed pages. Spam Precision Rate (SPR): the fraction of spam pages that were actually 9 The Technion Electrical Engineering Software Lab marked as spam by the DT, out of the number of pages that were marked as spam. Spam Recall Rate (SRR): the fraction of spam pages that were actually marked as spam by the DT, out of the number of real spam pages. Non-spam Precision Rate (NPR): the fraction of non-spam pages that were actually marked as non-spam by the DT, out of the number of pages that were marked as non-spam. Non-spam Recall Rate (NRR): the fraction of non-spam pages that were actually marked as non-spam by the DT, out of the number of pages that are really nonspam. The best results we managed to reach are summarized in the following list: MR = 92.7% SPR = 60.6% SRR = 71.5% NPR = 96.9% NRR = 95.3% These results aren't as good as in the paper but are at least positive. In order to investigate what hurt the performance we produced histograms for each attribute. Experimental setup: The data set we used is taken from http://www.yr-bcn.es/webspam/datasets and contains a large set of .uk pages downloaded in May 2006. A group of volunteers classified manually the whole dataset with normal/undecided/spam tags. The data set contains 8415 pages, from them we found 656 (7.8%) were dead/not responding. From the 7759 good pages we found 7024 (90.5%) were tagged as normal (not spam). 582 (7.5%) were tagged as spam. 153 (2%) were tagged as undecided. After filtering out very small pages (with less than 20 words in their text) that we consider as noise the data set contained 6875 pages (about 11% of the pages were filtered). The distribution was the following: 6183 (90%) were tagged as normal. 10 The Technion Electrical Engineering Software Lab 545 (8%) were tagged as spam. 147 (2%) were tagged as undecided. We reached our best results by constructing a decision tree from half of the dataset. The decision tree was built using entropy threshold = test score threshold = 0. 11 The Technion Electrical Engineering Software Lab spam non-spam number of words 18% 16% 14% 12% 10% 8% 6% 4% 2% 20 0 35 0 50 0 65 0 80 0 95 0 11 00 12 50 14 00 15 50 17 00 18 50 20 00 21 50 23 00 24 50 26 00 27 50 29 00 30 50 32 00 33 50 35 00 50 0% spam non-spam number of words in title 18% 16% 14% 12% 10% 8% 6% 4% 2% 12 49 47 45 43 41 39 37 35 33 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1 0% The Technion Electrical Engineering Software Lab spam non-spam Average word length 8% 7% 6% 5% 4% 3% 2% 1% spam non-spam 9. 6 9. 1 8. 6 8. 1 7. 6 7. 1 6. 6 6. 1 5. 6 5. 1 4. 6 4. 1 3. 6 3. 1 2. 6 2. 1 1. 6 1. 1 0. 6 0. 1 0% fraction of anchor text 15% 13% 11% 9% 7% 5% 3% 1% 0. 01 0. 06 0. 11 0. 16 0. 21 0. 26 0. 31 0. 36 0. 41 0. 46 0. 51 0. 56 0. 61 0. 66 0. 71 0. 76 0. 81 0. 86 0. 91 0. 96 -1% 13 The Technion Electrical Engineering Software Lab spam non-spam fraction of visible content 15% 13% 11% 9% 7% 5% 3% spam non-spam 96 91 0. 0. 86 0. 81 76 0. 0. 71 0. 66 61 0. 0. 56 0. 51 46 0. 0. 41 0. 36 0. 31 0. 26 0. 21 0. 16 0. 11 0. 0. 0. 01 -1% 06 1% compression ratio 15% 13% 11% 9% 7% 5% 3% 14 4. 3 4. 6 4. 9 5. 2 5. 5 5. 8 6. 1 6. 4 6. 7 20 4 0. 1 0. 4 0. 7 -1% 1 1. 3 1. 6 1. 9 2. 2 2. 5 2. 8 3. 1 3. 4 3. 7 1% -1% 15 0.85 0.89 0.93 0.97 0.85 0.89 0.93 0.97 0.61 0.57 0.53 0.49 0.45 0.41 0.37 0.33 0.29 0.25 0.81 1% 0.81 3% 0.77 5% 0.77 7% 0.73 9% 0.73 11% 0.69 13% 0.69 15% 0.65 fraction of popular keywords 0.65 0.61 0.57 0.53 0.49 0.45 0.41 0.37 0.33 spam non-spam 0.29 0.25 0.21 0.17 0.13 0.09 0.05 spam non-spam 0.21 0.17 0.13 0.09 0.05 0.01 -1% 0.01 The Technion Electrical Engineering Software Lab fraction of text drown from popular keywords 15% 13% 11% 9% 7% 5% 3% 1% The Technion Electrical Engineering Software Lab 8 Summary In this project we had the opportunity to explore a few hot topics in the field of search engines. One can learn the importance of this field by watching the impact of search engines like Google and Yahoo on the humanity. We focused on the topic of web spam, which is one of the corner stones for a good search engine. The increasing size of the advertisement market on the internet leads many people to exploit different flaws in order to optimize their website's rank. Many books have already written about Search Engines Optimization (SEO) introducing a broad diverse of techniques, some of them legal (white hat), some of them less (grey hat) and some of them illegal (black hat). Search engines' developers are in constant contest with the developers of the SEO techniques. Our project investigated a narrow part of the methods used against SEO. We started with reading some theoretical background, and then we designed and implemented a classifier that marks web pages as spam or non-spam pages. After the design and implementation phase we constructed a few decision trees using different data sets of manually tagged pages. Using the decision trees we classified the rest of the pages and compared the results to the original tags. We measured the results of the classifier and produced histograms for each attribute. The histograms helped us investigate why the performance weren't as good as in the paper. We conclude that the size of the data set affected the accuracy of the decision tree, and there were many small pages that added noise to the process of constructing the decision trees. 9 Future work Our suggestions for those who are willing to contribute to this project are: Observing more attributes: In the paper few more page attributes are described, which we didn't observe in our parser. A classifier that analyzes each page using more attributes might reach better performance. Using alternative decision tree algorithm: In our project we used the jaDT package in order to construct a decision tree. Implementing an alternative decision tree constructor might lead to better results. 16 The Technion Electrical Engineering Software Lab Using the classifier in a search engine: It's possible to use the classifier in order to filter unwanted pages in a real search engine. The classifier can be combined with other filtering tools which use different techniques to filter spam pages. 10 Definitions, Acronyms, and Abbreviations – HTML - In computing, HyperText Markup Language (HTML) is the predominant markup language for the creation of web pages. It provides a means to describe the structure of text-based information in a document — by denoting certain text as headings, paragraphs, lists, and so on — and to supplement that text with interactive forms, embedded images, and other objects. HTML can also describe, to some degree, the appearance and semantics of a document, and can include embedded scripting language code which can affect the behavior of web browsers and other HTML processors. [From Wikipedia, the free encyclopedia] Spam - Spamming is the abuse of electronic messaging systems to send unsolicited, undesired bulk messages. While the most widely recognized form of spam is email spam, the term is applied to similar abuses in other media: instant messaging spam, Usenet newsgroup spam, Web search engine spam, spam in blogs, and mobile phone messaging spam. Spamming is economically viable because advertisers have no operating costs beyond the management of their mailing lists, and it is difficult to hold senders accountable for their mass mailings. Because the barrier to entry is so low, spammers are numerous, and the volume of unsolicited mail has become very high. The costs, such as lost productivity and fraud, are borne by the public and by Internet service providers, which have been forced to add extra capacity to cope with the deluge. Spamming is widely reviled, and has been the subject of legislation in many jurisdictions. [From Wikipedia, the free encyclopedia] Web Spam, Spamdexing - Search engines use a variety of algorithms to determine relevancy ranking. Some of these include determining whether the search term appears in the META keywords tag, others whether the search term appears in the body text or URL of a web page. A variety of techniques are used to spamdex (see below). Many search engines check for instances of spamdexing and will remove suspect pages from their indexes. [From Wikipedia, the free encyclopedia] 17 The Technion Electrical Engineering Software Lab The rise of spamdexing in the mid-1990s made the leading search engines of the time less useful, and the success of Google at both producing better search results and combating keyword spamming, through its reputation-based PageRank link analysis system, helped it become the dominant search site late in the decade, where it remains. While it has not been rendered useless by spamdexing, Google has not been immune to more sophisticated methods either. Google bombing is another form of search engine result manipulation, which involves placing hyperlinks that directly affect the rank of other sites. Common spamdexing techniques can be classified into two broad classes: content spam and link spam. [From Wikipedia, the free encyclopedia] Decision Tree - Decision tree is a predictive model; that is, a mapping of observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification tree or reduction tree. In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications. The machine learning technique for inducing a decision tree from data is called decision tree learning, or (colloquially) decision trees. [From Wikipedia, the free encyclopedia] Anchor Text – This text is suppose to give additional information about the links in the webpage. Some search engines use this text in order to rank relevance of a webpage to different keywords. Therefore some web spammers use this as a place to implant popular keywords in order to mislead the search engine. Popular keywords – We use this term to describe a list of N keywords that are very popular. A keyword is considered popular if it appears in many searches performed by end users. The more searches it appears in, the more popular this keyword is. Web spammers tend to implant popular keywords in their webpages, hoping this will make the search engine to rank the page higher. Therefore the appearance of ma many popular keywords in a webpage could hint this is a spam webpage. Our list contains 1000 popular keywords and we shall run experiments with different values of N to see the impact of the size of this list on the classifier performance. 18 The Technion Electrical Engineering Software Lab 11 User Interface This section describes how to run every module in the project. DestinedParser: Command line format: DestinedParser <list_of_pages> <db_name> <train/classify> Command line example: DestinedParser full_list.txt full_list train DestinedParser full_list.txt full_list classify The first argument is a file that contains a list of the to-be-parsed pages. The second argument is the name of the output db. The last argument tells the parser if the output should be in train format or in classify format. In train format the output file will contain the tag for each webpage in the end of its line. In classify format the output file won't contain tags. The output file will be named <db_name>.db Trainer: Command line format: Trainer <db_filename> Command line example: Trainer full_list.db The only argument is the name of the database from which the decision tree will be constructed. The resulted decision tree will be saved in the file <db_filename>.dt Classifier: Command line format: Classifier <dataset_file> <decision_tree_name> <output_file> Command line example: Classifier list1.db list1.db tree1_list1.txt The first argument is a file that contains the parsed webpages that we want to classify. The second argument is the name of the decision tree. The classifier assumes there's a file named <decision_tree_name>.dt which contains the decision tree. The last argument is the filename that will contain the output. The output will contain the same line for each of the webpages, with the decided tag in the end of its line. 12 Reference Material 12.1 The free data set of manually tagged HTML pages we used to train our classifier. http://www.yr-bcn.es/webspam/datasets/webspam-uk2006-1.1.tar.gz 12.2 HTMLParser's homepage http://htmlparser.sourceforge.net/ 12.3 Description of the C4.5 algorithm http://en.wikipedia.org/wiki/C4.5_algorithm 12.4 jaDTi's homepage http://www.run.montefiore.ulg.ac.be/~francois/software/jaDTi/ 19 The Technion Electrical Engineering Software Lab 20