The World Wide Web contains information on many subjects, and on

advertisement

Machine Learning and Understanding the Web

By Mark Chavira and Ulises Robles

February 14, 2000

Introduction

The World Wide Web contains information on many subjects, and on many subjects, the World Wide Web contains a lot of information. Undoubtedly, as it has grown, this store of data has developed into a mass of digitized knowledge that is unprecedented in both breadth and depth. Although researchers, hobbyists, and others have already discovered sundry uses for the resource, the sheer size of the WWW limits its use in many ways. To help manage the complexities of size, users have enlisted the aid of computers in ways that go beyond the simplistic ability to access pages by typing in URL's or by following the hyper-link structure. For example, Internet search engines allow users to find information in a way that is more convenient than and not always explicit in the hyperlinks. Although computers already help us manage the Web, we would like them to do more. We would like to be able to ask a computer general questions, questions to which answers exist on the Web, questions like, "Who is the chair of the Computer Science Department at University X?" However, for computers to give such assistance, they must be able to understand a large portion of the semantic content of the Web. Computers do not currently understand this content. Of course, there is good reason. The Web was not designed for computerized understanding. Instead, it was designed for human understanding. As a result, one could argue that the idea of getting a computer to understand the content of the Web is akin to the idea of getting computers to understand natural language, a goal that remains elusive. Among the opinions regarding solutions to this problem are two. This paper gives an overview of both, and then it gives three examples of ways in which researchers are exploring the second possible solution.

The Structuring Solution

The first solution, sometimes championed by database experts and those accustomed to working with other types of highly structured data, is to change the Web. An extreme notion of this view claims that the data in the WWW is completely unstructured, and so long as it remains so, computers will have no hope of providing a type of assistance that is analogous to answering SQL queries. Proponents of this view would like to impose structure on the information in the Web in a way that is analogous to the way that information is structured within a relational database. At the very least, they would like a new form of markup language to replace HTML, one that would essentially give the computer clues that would help it to infer the semantics of the text. In the extreme, proponents of this "structuring solution" would like pages on the Web to be like forms that authors fill, forms that computers know about and can interpret. To be universally applicable to the WWW, both the moderate and the extreme approach require that existing data be put into a different form and that new data published to the Web conform to the new structure.

The Learning Solution

The second solution, sometimes championed by Artificial Intelligence experts and those accustomed to working with uncertainty, is to make computers smarter. Proponents of this view argue against the structuring solution, saying that, because humans would need to do the structuring, the task is simply not feasible. Dr. Christopher

Manning, a professor in Linguistics and Computer Science at Stanford University, believes that there is simply too much information to convert; that the vast majority of people who publish pages to the Web would never agree to conform; and that the structure imposed would, to some extent, limit flexibility and expressiveness. Dr. Manning believes that the content on the Web is not unstructured. Rather, it possesses a "difficult structure," a structure that is possibly closer to the structure of natural language than to that of the "easy structure" found within a database

[2000.] Those who champion the "learning solution" believe that computerized understanding--at least some form of simple comprehension--is an achievable goal. Moreover, some believe that the problem of understanding the

Web is a simpler one than the problem of understanding natural language, because the Web does impose more structure than, say, a telephone conversation. There are, after all, links, tags, URL's and standard styles of Web page design that can give the computer hints about semantic intent. Even looking at the text in isolation, without making

Page 1 of 7

use of additional information within the page, has the potential to give good results. As a result, attempts at learning from text have direct applicability to learning from the Web. The remainder of this paper explores a small sampling of the work of researchers who champion the learning solution. The researchers in these examples work primarily on learning from text, without special consideration for other information that is embedded within Web pages.

Obviously, the task is big. Initial work in the area has focused on solving similar but much smaller problems, with the hope that solutions to these smaller problems will lead to something more.

Naive Bayes

Naive Bayes is the default learning algorithm used to classify textual documents. In a paper entitled "Learning to

Extract Symbolic Knowledge from the World Wide Web" [Craven et al., 1998] researchers at Carnegie Mellon

University describe some of their experience applying Naive Bayes to the Web. The researchers describe their goal as follows:

[The research effort has] the long term goal of automatically creating and maintaining a computerunderstandable knowledge base whose content mirrors that of the World Wide Web...Such a “World Wide

Knowledge Base” would consist of computer understandable assertions in symbolic, probabilistic form, and it would have many uses. At a minimum, it would allow much more effective information retrieval by supporting more sophisticated queries than current keyword-based search engines. Going a step further, it would enable new uses of the Web to support knowledge-based inference and problem solving.

The essential idea is to design a system which, when given (1) an "ontology specifying the classes and relations of interest" and (2) "training examples that represent instances of the ontology classes and relations," would then learn

"general procedures for extracting new instances of these classes and relations from the Web." In their preliminary work, the researches simplified the problem by focusing on a small subset of the Web and by attempting to train the computer to recognize a very limited set of concepts and relationships. More specifically, the team acquired approximately ten thousand pages from the sites of various Computer Science Departments. From these pages, they attempted to train the computer to recognize the kinds of objects depicted in Figure 1.

Entity

Other Activity Person Department

Research Project Course Faculty Staff Student

Figure 1 CMU Entity Hierarchy

In addition, they attempted to have the system learn the following relationships among the objects:

(InstructorOfCourse A B)

(MembersOfProject A B)

(DepartmentOfPerson A B)

The above relationships can be read, "A is the instructor of course B;" "A is a member of project B;" and "A is the department of person B." The two main goals of the research were (1) to train the computer to accurately assign one of the eight leaf classes to a given Web page from a CS department and (2) to train the computer to recognize the above relationships among the pages that represent the entities. The team used a combination of methods. We will consider goal (1), since it is to this goal that the group applied Naive Bayes.

Page 2 of 7

Underlying the work were a set of assumptions that further simplified the task. For example, during their work, the team assumed that each instance of an entity corresponds to exactly one page in their sample. As a consequence of this assumption, a student, for example, corresponds to the student's home page. If the student has a collection of pages at his or her site, then the main page would be matched with the student and the other pages would be categorized as "other." These simplifying assumptions are certainly a problem. However, as the project progresses into the future, the team intends to remove many of the simplifying assumptions to make their work more general.

During the first phase of the experiment, researchers hand-classified the pages. Afterward, the researchers applied a

Naive Bayes learner to a large subset of the ten thousand pages in order to generate a system that could identify the class corresponding to a page it had not yet seen and for which it did not already have the answers. The team then applied the system to the remaining pages to test the coverage and accuracy of the results. They used four-fold cross validation to check their results.

The page classification sub-problem demonstrates the kinds of results the group achieved. To classify a page, the researchers used a classifier that assigned a class c' to a document d according to the following equation: c'

 argmax c

 logPr(c) n

 i

T 

1

Pr(w i

| d)log



Pr(w

Pr(w i i

|

| c) d) 



Equation 1

The paper describes the terms in the equation as follows: "where n is the number of words in d, T is the size of the vocabulary, and w i

is the i-th word in the vocabulary. Pr(w i

| c) thus represents the probability of drawing w i

given a document from class c, and Pr(w i

| d) represents the frequency of occurrences of w i

in document d." The approach is familiar. Define a discriminant function for each class. For a given document d, run the document through each of the discriminant functions and choose the class that corresponds to the largest result. Each discriminant function makes use of Bayes law and, in this experiment, assumes feature independence (hence the “naive” part of name.)

Because of this assumption, the method used at CMU does not suffer terribly from the curse of dimensionality, which would otherwise become severe, as each word in the vocabulary adds an additional dimension. With coverage of approximately 20%, the average accuracy for each class was approximately 60%. With higher coverage the accuracy goes down, so that at 60% coverage, accuracy is roughly 40%. These numbers don't seem all that great. However, the simplifying assumptions contributed to low performance. For example, for a collection of pages that comprise a student's Web site, the classifier might choose many of them to correspond to an instance of student. Recall that, because of an artificial, simplifying assumption, only one page corresponds to the student, while the others get grouped into “other” class. As the researchers remove assumptions and adjust their learning algorithms accordingly, performance should improve. As a side note, when the researchers introduced some additional heuristics, such as heuristics that examine patterns in the URL, accuracy improved to 90% for 20% coverage and 70% for 60% coverage. We shall see how some other techniques perform better than Naive Bayes used in isolation.

Maximum Entropy

Over the years, other techniques for supervised learning in text classification have emerged. Nigam, Lafferty, and

McCallum [1999] describe one of them: Maximum Entropy, which has been applied to a variety of natural language tasks. Maximum Entropy estimates class probability distributions from a given set of labeled training data. The methodology the authors present is an iterative, scaling algorithm that maximizes the entropy distribution consistent with the features of the classes. The algorithm defines a model of the class distributions. The model begins as a uniform distribution since nothing is yet known about the distributions of the classes. The algorithm then changes the model in an iterative fashion. With each iteration, the algorithm uses the labeled training data to constrain the model so that it matches the data more closely. After the algorithm concludes, the model gives a good estimate of the distributions of the class labels given a document. The authors’ experiments show that Maximum Entropy is better than Naive Bayes but that Maximum Entropy sometimes suffers from over-fitting the training data due to poor feature selection. If priors are used together with Maximum Entropy, performance is better.

Page 3 of 7

The researchers first selected a set of features. The features used in this experiment were word counts. For each

(class, word) pair, the algorithm finds a count that is the expected number of times the word appears in document of the class, over the total number of words in the document. If a word occurs frequently in a given class, a corresponding weight for that class-word pair is set to be higher than for other class-word pairs having a different class and the same word. The authors point out that this method is typical in natural language classification.

Maximum Entropy starts by restricting the model distribution so that each class has the same expectation for a given feature. In other words, the researches initialized the expectations for each feature by taking the average of that feature over all the documents, as expressed by the following equation:

1

| D |

 d

D f i

(d, c(d))

1

| D |

 d

D c

P(c | d) f i

(d, c)

Equation 2 where each f i

(d, c) is a feature of document d for class c. From Equation 2, the paper concludes that the parametric

form of the conditional distribution of each class has the exponential form:

P(c | d)

1

Z(d) exp(

 i

λ i f i

(d, c))

Equation 3 where

 i

the parameter to be learned for feature i and where Z(d) a normalizing constant.

The authors then present the IIS (Improved Iterative Scaling) procedure, which is a hill climbing algorithm used to compute the parameters of the classifier, given the labeled data. The algorithm works in log likelihood space.

Given the training data, we can compute the log-likelihood of the model as follows:

I(

| D)

 log

 d

D

P

Λ

(c(d)/d)

  d

D i

λ i f i

(d, c(d))

Equation 4 d

D c i

λ i f i

(d, c) where

denotes a vector of conditional distributions of the form in Equation 3, one for each class and D is the set of

documents. A general outline of the Improved Scaling algorithm follows: Given the set of labeled documents D and a set of features f, perform the following steps:

1.

For all features f i

, compute the expected value over the training data according to Equation 2.

2.

Initialize all the feature parameters to be estimated (

 i

) to 0.

3.

Iterate until convergence occurs; i.e., until we reach the global maximum a.

Calculate the expected class labels for each document with the current parameters P  (C/D), i.e., solve Equation 3 b.

For each (

 i

) i.

Using standard Hill Climbing, find the step size

 i

that has a high log-likelihood. ii.

 i

=

 i

+

 i

4.

Output: The text classifier.

The analysis presented here shows that at each step, we can find changes to each

 i

. reaching convergence to the single global maximum in the likelihood "surface". The paper states that there are no local maxima.

Page 4 of 7

Compared to Naive Bayes techniques, Maximum Entropy does not require that the features be independent. For instance, in the phrase "Palo Alto” there are two words, which almost always occur together and rarely occur by themselves. Naive Bayes will consider the two words independently and count the phrase twice. Maximum

Entropy, however, will reduce the weight of these features by half, since the constraints are based on the expectation of the number of counts.

The authors use three data sets for evaluating the performance of the algorithm:

The WebKB data set [Craven et al., 1998] contains Web documents from university computer science departments. In the present research, they use those pages in this data set that correspond to student, faculty, course, and project pages (4199 pages).

The Industry Sector Hierarchy data set [McCallum and Nigam, 1998] contains company Web pages classified into a hierarchy of industry sectors. The 6440 pages divide into 71 classes and the hierarchy is 2 levels.

The Newsgroup data set [Joachims, 1997] contains about 20,000 rticles divided into 20 UseNet discussion groups. The project removed words that occur only once.

The authors also considered Naive Bayes as a textual classification algorithm. They present a comparison using two variants of Naive Bayes: scaled (the word count is scaled so that each document has a constant number of word occurrences) and un-scaled. They mentioned that the in most cases the scaled version is better than the regular

Naive Bayes.

The experimenters used cross validation to test results. For the Newsgroup and the Industry Sector data sets, the algorithm sometimes produced over-fitting results. To prevent this occurrence, the researchers stopped the iterations early. In all the test cases, Maximum Entropy performed better than regular Naive Bayes, especially on the WebKB dataset, where the algorithm achieved 40% reduction in error over Naive Bayes. However, compared to the scaled

Naive Bayes the Maximum Entropy results were sometimes better, sometimes slightly worse. The authors attribute the worse performances to over-fitting.

To help deal with over-fitting, the authors experimented with using Gaussian priors with the mean equal to zero and a diagonal covariance matrix. The fact that the matrix is diagonal implies that all features use the same variance.

Equation 5 describes the Gaussian distribution of the priors.

P (

)

  i

1

 i

2 exp(

λ i

2σ i

2

2

)

Equation 5 where

 i is the parameter for the i-th feature and

2 is its variance.

The paper also shows that over-fitting is reduced when using a Gaussian prior with Maximum Entropy. The classification error is better also. As a consequence of the above, the performance is better than that obtained when using scaled naive Bayes. When no over-fitting is encountered (without using priors), the performance is almost unchanged.

The authors point out a few shortcomings of their approach. One shortcoming is that the researchers used the same features for every class. This need not be the case, and the learner would be more flexible if the researchers did not use features in this way. Another relevant limitation is that the researchers used the same Gaussian prior variance in all the experiments. This approach is not correct, particularly when the training data is sparse. A possible improvement is to adjust the prior based on the amount of training data. The authors hypothesize that another improvement would result from using feature functions of the form log(counter) or some other sub-linear representation instead of the counts themselves. They have observed that using un-scaled counts gives decreased accuracy.

Page 5 of 7

Classifying with Unlabeled Training Data (Bootstrapping and EM)

Many machine learning tasks begin with (1) a set of classifications c

1

..c

n

and (2) a set of instances that need to be classified according to (1). The first step is to hand-label a number of training instances for input into the learning algorithm and test instances for input into the resulting classifier. In many cases, to obtain good results, the number of hand-labeled instances must be large. Herein lies a problem. Hand-labeling is usually difficult and timeconsuming. It would be desirable to skip this step. This problem also exists when turning to the Web and to classifying text in general. Hence the motivation for the research described in “Text Classification by Bootstrapping with Keywords, EM and Shrinkage” [McCallum and Nigam, 1999.] In this paper, researchers describe their approach to classifying text that does not require hand-labeled training instances. The researches begin with the goal of classifying Computer Science papers into 70 different topics (e.g. NLP, Interface Design, Multimedia,) which are sub-disciplines of Computer Science. The researchers proceed as follows:

1.

For each class c i

, define a short list of key words.

2.

Classify the documents according the list of key words. Some documents will remain unclassified.

3.

Construct a classifier using the instances that have been classified.

4.

Use the classifier to classify all instances.

5.

Iterate over steps (3) and (4) until convergence occurs.

Step (1) is to choose keywords for each class. A person chooses these keywords in a way he or she believes will help identify instances of the class. This selection process is not easy; it requires much thought, trial, and error.

However, it requires much less work than hand-labeling data. Some of the key words the researchers chose follow:

Keywords Topic

NLP

Interface Design

Multimedia language, natural, processing, information, text interface, design, user, sketch, interfaces real, time, data, media

Step (2) classifies documents according to the keywords. To perform this step, the computer essentially searches for the first keyword in the given document and assigns the document to the corresponding class. Doing so leaves many documents unclassified and some documents misclassified. In step (3), the researchers use a Naive Bayes learner to construct an initial classifier using the documents classified in step (2,) the labels obtained from step (2,) and

discriminant equations derived from Equation 6 .

P(c j

| d i

)

P(c j

)P(d i

| c j

)

Equation 6

P(c j

)

|d i k

 |

1

P(w d i

, k

| c j

) where c j

is the class, d i

is the document being considered, and w di,k

is the k-th word in document d i

. Finally, the researches use the results obtained thus far to “bootstrap” the EM algorithm. That is, they use the results as a starting point for the EM learner, which successively improves the quality of the classifier.

The paper describes EM as follows:

EM is a class of interactive algorithms for maximum likelihood or maximum a posteriori parameter estimation in problems with incomplete data [Dempster et al., 1977.] Given a model of data generation and data with some missing values, EM iteratively uses the current model to estimate the missing values, and then uses the missing value estimates to improve the model. Using all the available data, EM will locally maximize the likelihood of the parameters and give estimates for the missing values. In our scenario, the class labels of the unlabeled data are the missing values.

EM essentially consists of two phases. The “E” phase calculates probabilistically-weighted class labels, P(c j

| d i

), for

every document using the current classifier and Equation 6. The “M” phase constructs a new classifier according to

Page 6 of 7

Naive Bayes from all of the classified instances. EM then iterates over the E and M phases until convergence occurs. In addition to EM, the authors also applied a technique known as shrinkage to assist with the sparseness of the data.

The researchers found that keywords alone provided 45% accuracy. The classifier that they constructed using bootstrapping, EM, and shrinkage, obtained 66% accuracy. The paper makes a note that this level of accuracy approaches estimated human agreement levels, which is 72%. Exactly what human agreement means, the paper left unclear. The paper does not discuss coverage.

Conclusion

Computers should be able to help us obtain information from the Web in ways that are more sophisticated than those used by current search engines. Different approaches have emerged to achieving this goal. The structuring approach seeks to change the Web, so that the Web is easier for computers to understand. The learning approach seeks to make computers smarter, so that they can understand the Web as it is. Because learning from the Web is similar to learning from text, textual approaches serve as one of the foundations for work within the learning approach. The default learning algorithm for producing text document classifiers is Naive Bayes. When applied to

Web page classification, Naive Bayes demonstrates results that are similar to those achieved when applied to text document classification. To improve on Naive Bayes, researchers have explored other learners, including Maximum

Entropy and EM, which replace and/or augment Naive Bayes. In some cases, these other learners outperform Naive

Bayes. Work in training computers to understand Web content and to be able to use that understanding to provide solutions is still in preliminary stages. The research has given some promising results.

References

[Manning, 2000] C. Manning. January 2000. Lecture before Digital Library Group at Stanford University.

January 2000.

[Craven et. all, 1998] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S.

Slattery. 1999. Learning to Extract Symbolic Knowledge from the World Wide Web.

[McCallum and Nigam, 1999] A. McCallum, K. Nigam. 1999. Text classification by Bootstrapping with

Keywords, EM and shrinkage.

[Nigam et. all, 1999] K. Nigam, J. Lafferty, A. McCallum. 1999. Using Maximum Entropy for Text

Classification.

[McCallum and Nigam, 1998] Andrew McCallum and Kamal Nigam. A comparison of event models for naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization . Tech. rep.

WS-98-05, AAAI Press. http://www.cs.cmu.edu/~mccallum .

[Joachims, 1997] Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Machine Learning: Proceedings of the Fourteenth International Conference (ICML

’97), pages 143-151, 1997.

[Demster et. all, 1977] A.P. Dempster, N.M. Laird, and D.B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithmn. Journal of the Royal Statistical Socity, Series B, 39(1):1-38.

Page 7 of 7

Download