Web Mining

advertisement

Lecture 9: The Future of Web

Mining

(Chap 9, Charkrabarti)

Wen-Hsiang Lu ( 盧文祥 )

Department of Computer Science and Information Engineering,

National Cheng Kung University

2004/12/23

Important Issues

• Information Extraction

• Natural Language Processing

• Question answering

• Profiles, Personalization, and Collaboration

Information Extraction

• An HR firm may wish to monitor the Web sites of businesses in a specific sector for available job positions with salaries and locations, and build and maintain a structured database containing this data to help design their pay packages.

• A market analyst may wish to monitor management changes in companies from a specified sector and get updates of the form “X replaced Y in position P of company C.”

• A researcher may wish to monitor a set of university and journal

Web sites for articles that claim to improve on a specific technique and to be notified with the title, authors, and a URL where the article is available online.

• An academic department may wish to monitor other universities for promising doctoral candidates to hire in specified areas, with related faculty being notified about significant publications by the candidates.

Lexical Network and Ontologies

• WordNet: English dictioary, unique concepts represented by nodes called synsets (synonym sets)

– bronco: bronco, mustang, pony, horse, equine, oddtoed ungulate, placental mammal, mammal, vertebrate, chordate, animal, organism

• Opposite of (antonym) relation is not between synsets but between words , for example

– wet: watery, damp, moist, humid, soggy

– dry: parched, arid, anhydrous, sere

– Only dry and wet are antonyms

Lexical Network and Ontologies

• An ontology is a kind of schema describing specific roles of entities and relations between entities

• For example

– PC troubleshooting site may use a custom ontology:

A hard disk, PCI bus, CPU, CPU fan, SCSI cables, jumper settings, device drivers, CD-ROMs, software, installation, etc.

– A university department comprising entities: faculty, student, administrative staff, research project, sponsor organization, research paper, journal, conference, and the like, together with relations

• A great deal of manual labor are needed to build lexical networks and ontologies

Part-of-Speech and Sense Tagging

• The extent of ambiguity in common words

– Run: 11 noun senses, 42 verb senses

• Delimiting regions of sentences with partof-speech (POS)

• A manually designed tag set and a collection of hand-tagged documents are needed for training a supervised tagger.

Word

The man still saw her

Part-of-Speech and Sense Tagging

• Approaches to IE and POS tagging are very similar

• HMMs can be used for POS tagging

• Over 130 POS used regularly http://www.comp.lancs.ac.uk/ucrel/claws1tags.html

Possible POS article noun, verb noun, verb, adjective, adverb noun, past-tense verb object pronoun, possessive pronoun

Part-of-Speech and Sense Tagging

• Accuracy of 96%~99% is not uncommon in statistical

POS tagging

• Word sense disambiguation (WSD) is initiated after POS tagging

• Ambiguous tokens are tagged with a sense identifier

• Consider a word w in the training text, which may be represented using a set of features

– E,g.: Interest

Usage

53%

21%

18%

Sense money paid for use of money a share in a business or company readiness to give attention

Parsing and Knowledge

Representation

• Morphological and syntactic analyses are only the initial steps of the long path to parsing the input and then representing natural language in a form that can be manipulated and searched by a computer

Parsing and Knowledge

Representation

• The sentences are quite simple, but it is nontrivial to infer that him refers to Raja in the passage

• Pronoun resolution is a special case of general resolution of references in sentences

Parsing and Knowledge

Representation

• Pragmatics also play important role in correct parsing

Raja ate bread with jam

Raja ate bread with Ravi

• Syntactic analysis can offer clues but not completely resolve such ambiguity

Parsing and Knowledge

Representation

• Most grammar for natural language is ambiguous

• The parser are not always context-free, and some might backtrack in source

Parsing and Knowledge

Representation

• Link Parser by Sleator and Temperley

• The Link Parser has a dictionary that stores terms associated with one or more linking requirements or constraints

Parsing and Knowledge

Representation

• (a) A set of word from the dictionary, each with one or more linking requirements

• (b) An illegal sentence and its unsuccessful parse

• (c) A legal sentence and its successful parse

• (d) A simpler way to show a legal parse graph

• (e) A relatively complex sentence parsed by the Link

Parser

Parsing and Knowledge

Representation

• A successful parse introduces links among the terms in the sentence so three properties hold:

– Satisfaction:

• Each linking requirement for each term in the sentence need to be satisfied by some connector of the opposite polarity emerging from some other word in the sentence

– Connectivity:

• The links introduced should be able to connect all the term in the sentence

– Planarity:

• The links introduced by the parser cannot cross when drawn above the sentence written on a line

Parsing and Knowledge

Representation

• The parses produced by the Link Parser or some other parser can be a foundation for representing textual content in a uniform graph formalism

• Once this is accomplished, the challenge would be in matching parse graphs to query graph and ranking the responses

• Suitably annotated parse graphs can also be used as an interlingus for translation between many languages

Download