Knowledge Discovery and Data Mining in Text

advertisement
Data Mining and Text-based Information
Mark Wasson
Senior Architect, Research Scientist
LexisNexis
mark.wasson@lexisnexis.com
August 27, 2002
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
1
The Agenda
• Knowledge Discovery, Data Mining, Text Mining
• From Free Text to Structured Metadata
• Knowledge Discovery and Data Mining in Text
• The Forecast for Data Mining and Text
• Information Sources and Links
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
2
Knowledge Discovery, Data Mining, Text Mining
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
3
What is Knowledge Discovery?
• Knowledge discovery in databases (KDD) is
defined as “the non-trivial process of identifying
valid, novel, potentially useful, and ultimately
understandable patterns in data.”
• Stated another way, KDD is the process of
applying scaled, optimized statistical processes
to large quantities of structured data in order to
help users discover new, potentially interesting
patterns and information in that data.
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
4
What Folks Do With KDD
• Find trends and patterns in current data in order
to support predictions or classification as new
data comes in
• Explain existing data, not just describe it
• Summarize the contents in a large database to
facilitate decision making
• Support “logical” (as opposed to graphical) data
visualization to support end users
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
5
What Folks Really Do With KDD
• Business trends and financial instrument
forecasting (e.g., predict the stock market)
• Fraud detection
• Merchandise handling and placement
• Finding hidden relationships between entities
• Credit worthiness evaluation and loan approvals
• Marketing and sales data analysis
• Recommender systems
• Customer Relationship Management (CRM)
• Bioinformatics (e.g., in silico drug discovery)
• Defect identification and tracking
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
6
The 9-step KDD Process
• Understand application domain; determine goals
• Create target dataset for analysis and discovery
• Clean data for noise, missing values, etc.
• Perform data reduction
• Choose best data mining method to meet goals
• Choose best data mining algorithm for method
• Conduct data mining, i.e., apply the algorithm
• Review results (novel? interesting?); redo steps
if necessary
• Consolidate discovered knowledge
Can be fully automated, but often highly interactive
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
7
What is Data Mining? (classic def’n)
• A synonym for Knowledge Discovery
• The statistical/analytical processing within the
KDD process
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
8
What Isn’t Data Mining (classic def’n)
• Online Analytical Processing (OLAP)
• Information Retrieval
• Finding and extracting proper names and other
pieces of information in a text
• Document categorization and indexing
• Simple descriptive statistics (e.g., average, mean,
median)
These tools do help find potentially interesting
existing information, but not discover new
information.
– Not necessarily new just because it’s new to you
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
9
What is Data Mining? (buzzword)
• With the emergence of successful data mining
applications in the mid to late-1990s, everyone
piled on to the term “data mining”
• Today “data mining” is widely used to label tools
and processes that
– Discover new, potentially interesting information
– Find existing, potentially interesting information
• “Knowledge discovery” still specifically
emphasizes discovery
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
10
What is Text Mining? (classic def’n)
• Text mining is the process of applying knowledge
discovery and data mining techniques to
information found in a collection of texts in order
to help users discover new, potentially interesting
patterns and information in that data.
• Combines information from multiple texts
– What is in an individual text is known information
• Authors know what they write
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
11
What is Text Mining? (buzzword)
• Computational linguists have piled on, too!
• Today, “text mining” is widely used to label tools
and processes that
– Discover new, potentially interesting information in text
collections
– Discover new, potentially interesting information in textbased information
– Find existing, potentially interesting information in text
and text collections
•
•
•
•
Information Retrieval
Named Entity, Relationship and Information Extraction
Categorization and Indexing
Question Answering
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
12
Today’s Key KDD Problems
• Not enough focus on the data
–
–
–
–
–
Collection
Cleansing
Scale
Completeness, including non-traditional sources
Structure
• Too much focus on algorithms
• The problem of Interestingness
– What is interesting?
– What isn’t?
– How do we tell the difference?
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
13
KDD and Text Problems
• We’re dealing with text!
– Text lacks structure that traditional data mining
processes can exploit
– Information within text generally are not labeled
– Actual and approximate synonymy
– Ambiguity
• Contrast with Spreadsheets, Databases, Etc.
– Well-defined structure
– Row, column headings identify content
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
14
How to “Fix” Text
Convert Information in Text to Metadata
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
15
From Free Text to Structured Metadata
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
16
What is Metadata?
• Metadata is data about data
• Content-based metadata is structured
information that is somehow derived from the
information content of a document rather than
from the format of a document
• Key Benefit for Data Mining: Structured
representation of content
• For our purposes references to “metadata” are
references to content-based metadata
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
17
Markup Languages and Metadata
• Standard Generalized Markup Language (SGML)
– Meta-language for defining markup languages
– Markup primarily used to support presentation
• Hypertext Markup Language (HTML)
– SGML-based markup language for the web
– Emphasis on structural elements of documents
• Extensible Markup Language (XML)
– Meta-language for defining markup languages
– Markup supports both presentation and
information/content identification
– Ability to support information/content identification is
severely limited by our ability to process text for content
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
18
Content-based Metadata
• Publisher-provided fields
–
–
–
–
–
–
Publication name
Title
Author
Date
Dateline
Topic-indicating terms
• A list of all the words and phrases in a document
–
–
–
–
Simple list
List of unique words and phrases
Sets of related terms
Frequency information
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
19
Content-based Metadata
• Specialized terms
–
–
–
–
–
Named entities (companies, people, places, etc.)
Citations, judges, attorneys, plaintiffs, defendants
Numerical information and monetary amounts
Noun phrases and their head nouns
Sentences
• Relationships
–
–
–
–
Items in close proximity
Subject-verb-object (agent-action-patient) relationships
Citation-based linkages
Coreference-based linkages
(John Smith left Microsoft. He joined IBM.)
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
20
Content-based Metadata
• Content-indicating annotations
–
–
–
–
–
Controlled vocabulary indexing
Statistically interesting extracted terms
Abstracts, summaries
Specialized fields
Domain templates
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
21
Value of Content-based Metadata
• Search support (information finding)
– Find and retrieve documents
– Link to related documents
• Analysis support (information understanding)
– Overall content summarization
• This has real value to information users
– Link metadata to documents via good document IDs
– Provide metadata to customers who can use it for
retrieval from their own search and analysis tools
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
22
Metadata Creation Technologies
• Publisher-provided fields
– Some basic standardization helps
• Simple term listing and counting
– Generally easy, and quite good
• Finding Specialized Terms
– Lots of good pattern recognition tools, including SRA’s
NetOwl, Inxight’s ThingFinder
– Pattern recognition, lexicons do well for most categories
(literary titles, product names are hard)
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
23
Metadata Creation Technologies
• Linguistics-based lexical tools
– Morphological analysis, part of speech tagging
– Inxight’s LinguistX
• Sentence boundary detection
– Easily doable, but many need to consider more text
• Linguistics-based syntactic tools
–
–
–
–
Shallow parsing
Deep parsing
Coreference resolution
Varied text, difficult but progressing
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
24
Metadata Creation Technologies
• Finding related items
– Proximity, within sentence easy
– Subject-verb-object/agent-action-patient requires some
degree of parsing
– Coreference-based relationship finding requires
coreference resolution
– SRA’s NetOwl
– ClearForest’s rule books
– Insightful’s InFact, SVO
– Cymfony’s Brand Dashboard
– Attensity, SVO
– Alias I, coreference-based
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
25
Metadata Creation Technologies
• Template-driven extraction
– Often combines many technologies into domain-specific
applications
– Clear Forest’s rule books
– WhizBang (defunct, now Inxight?) machine learningbased extraction
– Various “web-farming” technologies, e.g., Caesius
– University of Sheffield’s GATE tool kit
• Automatic abstracting/summarization
– Leading text best for individual news documents
– Columbia University’s NewsBlaster for multiple texts
– True summary generation – a hard problem
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
26
Metadata Creation Technologies
• Document categorization and indexing
–
–
–
–
–
–
–
80% - 90% accurate (recall and precision) common
Often integrated with editorial processes
Inxight
Nstein
Stratify
Verity
A lot of others
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
27
Metadata Creation Technologies
• Metadata creation technologies
– Text mining?
• Read about them
– Natural Language Processing for Online Applications –
Text Retrieval, Extraction and Categorization (John
Benjamins Publishing Company, 2002)
Peter Jackson, Vice President of R&D, and
Isabelle Moulinier, Senior Research Scientist,
Thomson Legal & Regulatory
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
28
Knowledge Discovery and Data Mining in Text
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
29
Combining KDD and Metadata
• What is Knowledge Discovery in Metadata?
(The term is unique to us, by the way; Ronen Feldman et al
called this Knowledge Discovery in Text)
• It is KDD that incorporates document metadata
into its data collection step
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
30
Basic KDD Task Using Metadata
• Data source selection
• Metadata creation, organization
• Perhaps combine with other appropriate data
– Align data based on common attributes
– Align data based on date or time
– Use knowledge sources to guide analysis of metadata
(e.g., world knowledge, thesauri, etc.)
• Analyze the data
– Language-aware processes, e.g., SVO
– Routine processes that apply to structured content
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
31
Research Problems
• Does document metadata have value for KDD
applications in addition to its value for
information finding and retrieval purposes?
• If so, where?
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
32
Example 1 – Trend Analysis
• Research at LexisNexis
• Can daily “hot topics” be identified automatically
by comparing today’s indexing frequency for the
topic to its recent history?
– Track controlled vocabulary indexing assignments over
time to determine a historical average
– Compare today’s frequency of assignment for a given
company’s index term to its historical average
– If it exceeds some threshold, flag it as a “hot” company
in that day’s news
– Analysts confirmed 96.2% of 1,137 flagged companies,
company pairs were in fact “hot”
See Shewhart & Wasson (1999)
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
33
Example 2 – Emerging Technologies
• Research at IBM
• Can trends in emerging and fading technologies
be identified?
– Extract, normalize and monitor vocabulary found in
documents and compare it to document categories
– Provide users with a querying tool where they can
specify the “shape” of the trend
– Used patent data
See Lent et al. (1997)
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
34
Example 3 - Influence of News Stories
• Work at University of Massachusetts
• Can specific news stories be identified that will
influence the behavior in financial markets?
– Examine features of news articles that occurred before
interesting changes in the financial markets
– Find patterns of features that regularly occur before
interesting changes
– In future data, monitor incoming stories for those
patterns for alert purposes
– Real-time data, real-time stock prices
See Lavrenko et al. (2000)
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
35
Example 4 - Citation Pattern Analysis
• Can citation histories be used to identify potential
relationships between specific illnesses and
other features, exposures, medications, etc.
– Collect the citations in a large medical texts collection
– Examine citation chains in pairs of domains that do not
directly cite one another
– Measure the amount of overlap in the citation chain
– Verify results through clinical medical research
See Swanson & Smalheiser (1996)
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
36
Example 5 - Sentiment Detection
• Work at Webmind (out of business)
• Is the tone of news stories, Usenet discussions,
website stories, etc., about some company, its
management or its products positive or negative?
– Use categorization technology to determine the positive
or negative tone in individual documents about a given
company or its products
– Combine results across all documents about that
company or its products
– Compute a score or summarize the results
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
37
Example 6 - Link Genes to Diseases
• Work at Hewlett Packard Laboratories
• Can sets of genes be associated with given
diseases by analyzing MEDLINE abstracts?
– Identify references to genes, addressing major problems
with recognition, ambiguity and synonymy in this
domain
– Identify references to targeted diseases
– Statistically analyze co-occurrence patterns between
mentions of the genes and mentions of diseases for
statistically significant correlations
See Adamic et al. (2002)
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
38
Additional Examples
• Analyzing the activities of a person, company or
organization using its role as subject/agent or
object/patient in clauses
• Predicting the spread between borrowing and
lending interest rates
• Identifying technical traders in the T-bonds
futures market
• Daily predictions of major stock indexes
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
39
Data Mining and Text Vendors
•
•
•
•
•
•
•
•
•
•
Alias I
Attensity
ClearForest
eNeuralNet
IBM (Intelligent Miner for Text)
Inforsense
Insightful (InFact)
Megaputer Intelligence
SAS (Enterprise Miner, Inxight)
SPSS (LexiQuest)
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
40
The Forecast for Data Mining and Text
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
41
What is the forecast for KDT?
• Can we get information from unstructured (free)
text into some structured format?
• Are there enough interesting KDD applications
where access to content-based metadata from
text actually produces interesting results?
• Does adding text-based information to existing
data mining and knowledge discovery
applications make them better?
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
42
KDT, 1996-1999
• A handful of interesting experiments published
– Mostly one-off experiments
– Almost no evidence any of it was commercialized
• Holding back the research
– Almost no one had access to large quantities of
appropriate metadata for research purposes
– Linguistics technologies still maturing, often too slow
– Almost no one had the combination of content and tools
to generate large quantities of appropriate metadata for
research purposes
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
43
KDT, 2000+
• Movement. Early stages, but movement
• Maturing, scaleable tools in classification and
extraction from web content and other texts to
create metadata
• Products from the Big 3 analytical tool providers
(SAS, SPSS, Insightful)
• Companies created to focus on it (not always
successful), such as ClearForest, Webmind
• Emerging importance of bioinformatics,
availability of MEDLINE content
• But data mining hit hard by dot-com collapse
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
44
The Forecast
• KDT is emerging, but slowly
• Still in early stages
• Lots of promise
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
45
Information Sources and Links
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
46
Resources
• KDnuggets, http://www.kdnuggets.com
• ACM Special Interest Group in Knowledge
Discovery and Data Mining,
http://www.acm.org/sigkdd/
• Association for Computational Linguistics,
http://www.aclweb.org
• Data Mining and Knowledge Discovery (journal),
Kluwer Academic Publishers,
http://www.digimine.com/usama/datamine/
• Companies,
http://www.kdnuggets.com/companies/
• Glossary of Terms,
http://www3.shore.net/~kht/glossary.htm
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
47
Related Technical Conferences
• The 3rd SIAM International Conference on Data
Mining, May 1-3, 2003, San Francisco, CA
http://www.siam.org/meetings/sdm03/
• 2003 North American Association for
Computational Linguistics/Human Language
Technology Joint Conference, approx. early June,
2003, Edmonton, AB
http://www.aclweb.org
• The 9th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, August
24-27, 2003, Washington, DC
http://www.acm.org/sigkdd/kdd2003/
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
48
Books
• Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., &
Uthurusamy, R. (1996). Advances in Knowledge
Discovery and Data Mining. AAAI Press / The MIT
Press.
• Jackson, P., & Moulinier, I. (2002). Natural
Language Processing for Online Applications –
Text Retrieval, Extraction and Categorization.
John Benjamins Publishing Company.
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
49
Company Links
Attensity, http://www.attensity.com
Alias I, http://www.alias-i.com
Caesius, http://www.caesius.com
ClearForest, http://www.clearforest.com
Columbia University,
http://www.cs.columbia.edu/nlp/newsblaster/
Cymfony, http://www.cymfony.com
eNeuralNet, http://www.eneuralnet.com
Hewlett Packard Labs,
http://www.hpl.hp.com/org/stl/dmsd/
IBM, http://www-3.ibm.com/software/data/iminer/
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
50
Company Links
Inforsense, http://www.inforsense.com
Insightful, http://www.insightful.com
Inxight, http://www.inxight.com
John Benjamins Publishing,
http://www.benjamins.com/cgibin/t_bookview.cgi?bookid=NLP_5
Megaputer Intelligence, http://www.megaputer.com
Nstein, http://www.nstein.com
SAS, http://www.sas.com
SPSS, http://www.spss.com
SRA International, http://www.sra.com
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
51
Company Links
Stratify, http://www.stratify.com
University of Massachusetts-Amherst,
http://ciir.cs.umass.edu/
University of Sheffield, http://gate.ac.uk/
Verity, http://www.verity.com
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
52
Data Mining/Text References
Adamic, L., Wilkinson, D., Huberman, B., & Adar, E. (2002). A
Literature Based Method for Identifying Gene-Disease
Connections. Proceedings of the 1st IEEE Computer Society
Bioinformatics Conference.
Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D., & Allan, J.
(2000). Language Models for Financial News Recommendation.
Proceedings of the 9th International Conference on Information
and Knowledge Management.
Lent, B., Agrawal, R., & Srikant, R. (1997). Discovering Trends in Text
Databases. Proceedings of the 3rd International Conference on
Knowledge Discovery and Data Mining.
Shewhart, M., & Wasson, M. (1999). Monitoring Newsfeeds for “Hot
Topics.” Proceedings of the 5th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining.
Swanson, D., & Smalheiser, N. (1996). Undiscovered Public
Knowledge: A Ten-year Update. Proceedings of the 2nd
International Conference on Knowledge Discovery and Data
Mining.
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
53
Questions?
You can also contact me at
mark.wasson@lexisnexis.com
August 27, 2002
Data Mining and Text-based
Information - Mark Wasson
54
Download