data_mining_text_mining - Donald Bren School of Information

advertisement
Introduction to Biomedical Informatics
Text Mining
© Hayes/Smyth: Introduction to Biomedical Informatics: 1
Outline
• Introduction and Motivation
• Techniques
–
–
–
–
Document classification
Document clustering
Topic discovery from text
Information extraction
• Additional Resources and Recommended Reading
© Hayes/Smyth: Introduction to Biomedical Informatics: 2
Motivation for Text Mining
• PubMed
– PubMed approximately 1 million new articles per year
– Human annotation cannot keep up – increased demand for automation
– Problem is even greater in other domains (e.g., Web search in general)
© Hayes/Smyth: Introduction to Biomedical Informatics: 3
From Jensen, Saric, Bork,
Nature Reviews Genetics, 2006
© Hayes/Smyth: Introduction to Biomedical Informatics: 4
Text Mining Problems
• Classification: automatically assign a document into 1 or more categories
– “easy” problem: is an email spam or non-spam?
– “hard” problem: assign MesH terms to new PubMed articles
• Clustering: Group a set of documents into clusters
– e.g., automatically group docs in search results by theme
• Topic Discovery: Discover themes in docs and index docs by theme
– E.g., discover new scientific concepts not yet covered by MeSH terms
• Information Extraction: Extract mentions of entities from documents
– “easy” problem: extract all gene names mentioned in a document
– “hard” problem: extract a set of facts relating genes in a document, e.g.,
statements such as “gene A activates gene B”
© Hayes/Smyth: Introduction to Biomedical Informatics: 5
Classification and Clustering
• We already discussed these methods in the context of general data
mining in earlier lectures. Now we want to apply these techniques
specifically to text data
• Recall:
– Given a vector of features x, a classifier maps x to a target variable y,
where y is categorical, e.g., y = {has cancer, does not have cancer}
– Learning a classifier consists of being given a training data set of pairs of
x’s and y’s, and learning how to map x to y
– Clustering is similar, but our target data doesn’t have any target y values
– we have to discover the target values (the “clusters”) automatically
© Hayes/Smyth: Introduction to Biomedical Informatics: 6
Classification and Clustering for Text Documents
• Document representation
– Most classification and clustering algorithms assume that each object
(here a document) to be classified can be represented as a fixed length
vector of variables/feature/attribute values
– So how do we convert documents into fixed-length vectors?
– “Bag of Words” representation
• Each vector entry represents whether term j occurred in a document, or how
often it occurred
• Same idea as for information retrieval
• Ignores word order, document structure….but found to work well in practice
and has considerable computational advantages over working with the
document directly
Once we have our vector (bag of words) representation we can use any
classification or clustering method on our documents
© Hayes/Smyth: Introduction to Biomedical Informatics: 7
Document Classification
© Hayes/Smyth: Introduction to Biomedical Informatics: 8
Document Classification
•
Document classification has many applications
– Spam email detection
– Automated tagging of biomedical articles (e.g., in PubMed)
– Automated creation of Web-page taxonomies
•
Data Representation
– “Bag of words” most commonly used: either counts or binary
– Can also use “phrases” for commonly occurring combinations of words
•
Classification Methods
– Naïve Bayes widely used (e.g., for spam email)
• Fast and reasonably accurate
– Support vector machines (SVMs)
• Often the most accurate method in research studies
• But more complex computationally than other methods
– Logistic Regression (regularized)
• Widely used in industry, often competitive with SVMs
© Hayes/Smyth: Introduction to Biomedical Informatics: 9
Trimming the Vocabulary
•
Stopword removal:
– remove “non-content” words
• very frequent “stop words” such as “the”, “and”….
– remove very rare words, e.g., that only occur a few times in 1 million documents
• Often results in removal of 30% or more of the original unique words
•
Stemming:
– Reduce all variants of a word to a single term
– e.g., {draw, drawing, drawings} -> “draw”
– Can use Porter stemming algorithm (1980)
•
This still often leaves us with 10000 to 1 million unique terms
=> a very high-dimensional classification problem!
© Hayes/Smyth: Introduction to Biomedical Informatics: 10
Classifying Term Vectors
• Typically multiple different terms or words may be helpful
– Class = “finance”
– Words = “stocks”, “return”, “interest”, “rate”, etc.
– Thus, classifiers that combine multiple features often do well, e.g,
• Naïve Bayes, Logistic regression, SVMs, etc
(compared to decision trees, for example, which would branch on 1 word at a time)
• Linear classifiers often perform well in high-dimensions
– Typically we have a large number of features/dimensions in text classification
– Theory and experiments tell us linear classifiers do well in high dimensions
– So naïve Bayes, logistic regression, linear SVMS, are all useful
• Main questions in practice are:
– which terms to use in the classifier?
– which linear classifier to select?
© Hayes/Smyth: Introduction to Biomedical Informatics: 11
Feature Selection
• Performance of text classification algorithms can be optimized by
selecting only a subset of the discriminative terms
– See classification results later in these slides
• Greedy search
– Start from empty set or full set and add/delete one at a time
– Heuristics for adding/deleting
• Information gain (mutual information of term with class)
• Chi-square
• Other ideas
– Methods tend not to be particularly sensitive to the specific heuristic
used for feature selection, but some form of feature selection often
improves performance
© Hayes/Smyth: Introduction to Biomedical Informatics: 12
Example of Role of Feature Selection
(from Chakrabarti, Chapter 5)
9600 documents from US Patent database
20,000 raw features (terms)
© Hayes/Smyth: Introduction to Biomedical Informatics: 13
Types of Classifiers
Let c be the class label and let x be a vector of features
• Generative/Probabilistic
– Model p(x | c) for each class, then estimate p(c | x)
– e.g., naïve Bayes model
• Conditional Probability/Regression
– Model p(c | x) directly, e.g.,
– e.g., logistic regression
• Discriminative
– Look for decision boundaries in input space x directly
• No probabilities
– e.g., perceptron, linear discriminants, SVMs, etc
© Hayes/Smyth: Introduction to Biomedical Informatics: 14
Probabilistic “Generative” Classifiers
• Model p( x | ck ) for each class and perform classification via Bayes rule,
c = arg max { p( ck | x ) } = arg max { p( x | ck ) p(ck) }
• How to model p( x | ck )?
– p( x | ck ) = probability of a “bag of words” x given a class ck
• Two commonly used approaches (for text):
– Naïve Bayes: treat each term xj as being conditionally independent, given ck
– Multinomial: model a document with N words as N tosses of a p-sided die
© Hayes/Smyth: Introduction to Biomedical Informatics: 15
Naïve Bayes Classifier for Text
• Naïve Bayes classifier = conditional independence model
– Assumes conditional independence assumption given the class:
p( x | ck ) = P p( xj | ck )
– Note that we model each term xj as a discrete random variable
– Binary terms (Bernoulli):
p( x | ck ) = P p( xj = 1 | ck ) P p( xj = 0 | ck )
© Hayes/Smyth: Introduction to Biomedical Informatics: 16
Multinomial Classifier for Text
• Multinomial Classification model
Assume that the data are generated by a p-sided die (multinomial model)
N
p(x | ck )   p( xj | ck )
nj
j 1
where N = total number of terms
nj = number of times term j occurs in the document
Here we have a single random variable for each class, and the p( xj = i | ck )
probabilities sum to 1 over i (i.e., a multinomial model)
© Hayes/Smyth: Introduction to Biomedical Informatics: 17
Highest Probability Terms in Multinomial Distributions
© Hayes/Smyth: Introduction to Biomedical Informatics: 18
Common Data Sets used for Evaluation
•
Reuters
– 10700 labeled documents
– 10% documents with multiple class labels
•
Yahoo! Science Hierarchy
– 95 disjoint classes with 13,598 pages
•
20 Newsgroups data
– 18800 labeled USENET postings
– 20 leaf classes, 5 root level classes
•
WebKB
– 8300 documents in 7 categories such as “faculty”, “course”, “student”.
•
Industry
– 6449 home pages of companies partitioned into 71 classes
© Hayes/Smyth: Introduction to Biomedical Informatics: 19
Comparing Naïve Bayes and Multinomial models
McCallum and Nigam (1998)
Found that multinomial outperformed naïve Bayes (with binary features) in text
classification experiments
© Hayes/Smyth: Introduction to Biomedical Informatics: 20
Comparing Multinomial and Bernoulli on Reuter’s Data
(from McCallum and Nigam, 1998)
© Hayes/Smyth: Introduction to Biomedical Informatics: 21
Comparing Bernoulli and Multinomial
(slide from Chris Manning, Stanford)
Results from classifying 13,589 Yahoo! Web pages in Science subtree
of hierarchy into 95 different classes
© Hayes/Smyth: Introduction to Biomedical Informatics: 22
WebKB Data Set
• Train on ~5,000 hand-labeled web pages
– Cornell, Washington, U.Texas, Wisconsin
• Crawl and classify a new site (CMU)
• Results:
Student
Extracted
180
Correct
130
Accuracy:
72%
Faculty
66
28
42%
Person
246
194
79%
Project
99
72
73%
Course
28
25
89%
Departmt
1
1
100%
© Hayes/Smyth: Introduction to Biomedical Informatics: 23
Comparing Bernoulli and Multinomial on Web KB Data
© Hayes/Smyth: Introduction to Biomedical Informatics: 24
Comments on Generative Models for Text
(Comments applicable to both Naïve Bayes and Multinomial classifiers)
•
Simple and fast => popular in practice
– e.g., linear in p, n, M for both training and prediction
• Training = “smoothed” frequency counts, e.g.,
p(ck | xj  1) 
nk , j  k
nk   m
– e.g., easy to use in situations where classifier needs to be updated regularly (e.g., for
spam email)
•
Numerical issues
– Typically work with log p( ck | x ), etc., to avoid numerical underflow
– Useful trick:
• when computing S log p( xj | ck ) , for sparse data, it may be much faster to
– precompute S log p( xj = 0| ck )
– and then subtract off the log p( xj = 1| ck ) terms
•
Note: both models are “wrong”: but for classification are often sufficient
© Hayes/Smyth: Introduction to Biomedical Informatics: 25
optional
Beyond independence
•
Naïve Bayes and multinomial both assume conditional independence of words
given class
•
Alternative approaches try to account for higher-order dependencies
– Bayesian networks:
•
•
•
•
•
p(x | c) = Px p(x | parents(x), c)
Equivalent to directed graph where edges represent direct dependencies
Various algorithms that search for a good network structure
Useful for improving quality of distribution model
….however, this does not always translate into better classification
– Maximum entropy models
•
•
•
•
•
•
p(x | c) = 1/Z Psubsets f( subsets(x) | c)
Equivalent to undirected graph model
Estimation is equivalent to maximum entropy assumption
Feature selection is crucial (which f terms to include) –
can provide high accuracy classification
…. however, tends to be computationally complex to fit (estimating Z is difficult)
© Hayes/Smyth: Introduction to Biomedical Informatics: 26
Basic Concepts of Support Vector Machines
Circles = support vectors
= points on convex hull
that are closest to
hyperplane
M = margin = distance of
support vectors from hyperplane
Goal is to find weight vector
that maximizes M
© Hayes/Smyth: Introduction to Biomedical Informatics: 27
Reuters Data Set
•
•
•
21578 documents, labeled manually
9603 training, 3299 test articles
118 categories
– An article can be in more than one category
– Learn 118 binary category distinctions
•
Example “interest rate” article
2-APR-1987 06:35:19.50
west-germany
b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
FRANKFURT, March 2
The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in
answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency
financing rate at 5.0 pct.
Common categories
(#train, #test)
•
•
•
•
•
Earn (2877, 1087)
Acquisitions (1650, 179)
Money-fx (538, 179)
Grain (433, 149)
Crude (389, 189)
•
•
•
•
•
Trade (369,119)
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)
© Hayes/Smyth: Introduction to Biomedical Informatics: 28
Dumais et al. 1998: Reuters - Accuracy
earn
acq
money-fx
grain
crude
trade
interest
ship
wheat
corn
Avg Top 10
Avg All Cat
NBayes
Trees
LinearSVM
95.9%
97.8%
98.2%
87.8%
89.7%
92.8%
56.6%
66.2%
74.0%
78.8%
85.0%
92.4%
79.5%
85.0%
88.3%
63.9%
72.5%
73.5%
64.9%
67.1%
76.3%
85.4%
74.2%
78.0%
69.7%
92.5%
89.7%
65.3%
91.8%
91.1%
81.5%
75.2% na
88.4%
91.4%
86.4%
© Hayes/Smyth: Introduction to Biomedical Informatics: 29
Precision-Recall for SVM (linear),
Naïve Bayes, and NN (from Dumais 1998)
using the Reuters data set
© Hayes/Smyth: Introduction to Biomedical Informatics: 30
Comparing Text Classifiers
• Naïve Bayes or Multinomial models
– Low time complexity (training = single linear pass through the data)
– Generally good, but not always best performance
– Widely used for spam email filtering
• Linear SVMs, Logistic Regression
– Often produce best results in research studies
– But more computationally complex to train (particularly SVMs)
• Others
– decision trees: less widely used, but can be useful
© Hayes/Smyth: Introduction to Biomedical Informatics: 31
optional
Learning with Labeled and Unlabeled documents
• In practice, obtaining labels for documents is time-consuming, expensive,
and error prone
– Typical application: small number of labeled docs and a very large number of
unlabeled docs
• Idea:
– Build a probabilistic model on labeled docs
– Classify the unlabeled docs, get p(class | doc) for each class and doc
• This is equivalent to the E-step in the EM algorithm
– Now relearn the probabilistic model using the new “soft labels”
• This is equivalent to the M-step in the EM algorithm
– Continue to iterate until convergence (e.g., class probabilities do not change
significantly)
– This EM approach to classification shows that unlabeled data can help in
classification performance, compared to labeled data alone
© Hayes/Smyth: Introduction to Biomedical Informatics: 32
Learning with Labeled and Unlabeled Data
(from “Semi-supervised text classification using EM”, Nigam, McCallum, and Mitchell, 2006)
© Hayes/Smyth: Introduction to Biomedical Informatics: 33
Other issues in text classification
• Real-time constraints:
– Being able to update classifiers as new data arrives
– Being able to make predictions very quickly in real-time
• Document length
– Varying document length can be a problem for some classifiers
– Multinomial tends to be better than Bernoulli for example
• Multi-labels and multiple classes
– Text documents can have more than one label
– SVMs for example can only handle binary data
© Hayes/Smyth: Introduction to Biomedical Informatics: 34
Other issues in text classification (continued)
• Feature selection
– Experiments have shown that feature selection (e.g., by greedy
algorithms using information gain) can often improve results
• Linked documents
– Can view Web documents as nodes in a directed graph
– Classification can now be performed that leverages the link structure,
• Heuristic = class labels of linked pages are more likely to be the same
– Optimal solution is to classify all documents jointly rather than
individually
– Resulting “global classification” problem is typically computationally
complex
© Hayes/Smyth: Introduction to Biomedical Informatics: 35
Background Resources: Document Classification
•
S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data,
Morgan Kaufmann, 2003.
– See chapter 5 for discussion of text classification
•
C. D. Manning, P. Raghavan, H. Schutze, Introduction to Information
Retrieval, Cambridge University Press, 2008
– Chapters 13 to 15 on text classification
– (and chapters 16 and 17 on text clustering)
– http://nlp.stanford.edu/IR-book/information-retrieval-book.html
•
SVMs for text classification
– T. Joachims, Learning to Classify Text using Support Vector Machines: Methods,
Theory and Algorithms, Kluwer, 2002
© Hayes/Smyth: Introduction to Biomedical Informatics: 36
Document Clustering
© Hayes/Smyth: Introduction to Biomedical Informatics: 37
Document Clustering
• For clustering we can use either
– Vectors to represent each document (E.g., bag of words)
• Useful for clustering algorithms such as k-means, probabilistic clustering, or
– For N documents, define an N x N similarity matrix
• Doc-doc similarity can be defined in different ways (e.g., TF-IDF)
• Useful for clustering methods such as hierarchical clustering
• Unlike classification, there is typically less concern with selecting the
“best” vocabulary for clustering
– remove common stop words and infrequent words
© Hayes/Smyth: Introduction to Biomedical Informatics: 38
Case Study: Clustering of 2 Million PubMed Articles
Reference: Boyack et al, Clustering more than Two Million Biomedical
Publications…, PLoS One, 6(3), March 2011
• Data Set : 2.15 million articles in PubMed
– all articles published between 2004 and 2008, with at least 5 MeSH terms
• Data for each document
– MeSH terms
– Words from titles and abstracts
• Preprocessing
– MEDLINE stopword list of 132 words + 300 words commonly used at NIH
– Terms appearing in less than 4 documents were removed
– 272,926 unique terms and 175 million document-term pairs
© Hayes/Smyth: Introduction to Biomedical Informatics: 39
Methodology
• Methods compared
– Data sources: MeSH only versus Title/Abstract words only
– Similarity metrics
•
•
•
•
•
Tf-idf cosine (see earlier lectures)
Latent semantic indexing/analysis (see earlier lectures)
Topic modeling (discussed later in these slides)
Self-organizing map (neural network method)
Poisson-based model
– 25 million similarity pairs computed
• Approximately top-12 most similar documents for each document
• 9 sets of clusters compared
• 9 combinations of clustering data-source+similarity metric evaluated
• Hierarchical (single-link) clustering applied to each of the 9 similarity sets
• Heuristics used to determine when clusters can no longer be merged
© Hayes/Smyth: Introduction to Biomedical Informatics: 40
Evaluation Methods and Results
• Evaluation metrics
– Textual coherence within a cluster (see paper)
– Coherence of papers within a cluster in terms of funding source
(Question: how reliable are these metrics?)
• Conclusions (from the paper)
– PubMed’s own related article approach (PMRA) generated the most
coherent and most concentrated cluster solution of the nine text-based
similarity approaches tested, followed closely by the BM25 approach
using titles and abstracts.
– Approaches using only MeSH subject headings were not competitive with
those based on titles and abstracts.
© Hayes/Smyth: Introduction to Biomedical Informatics: 41
Textual
Coherence
Results
© Hayes/Smyth: Introduction to Biomedical Informatics: 42
Two-dimensional map of
the highest-scoring cluster
solution, representing
nearly 29,000 clusters and
over two million articles.
© Hayes/Smyth: Introduction to Biomedical Informatics: 44
Background Resources: Document Clustering
•
Papers
– Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, et al. 2011
Clustering More than Two Million Biomedical Publications: Comparing the
Accuracies of Nine Text-Based Similarity Approaches.
PLoS ONE 6(3): e18029. doi:10.1371/journal.pone.0018029
– Douglass R. Cutting, David R. Karger, Jan O. Pedersen and John W. Tukey,
Scatter/Gather: a cluster-based approach to browsing large document collections,
Proceedings of ACM SIGIR '92.
– Ying Zhao and George Karypis (2005) Hierarchical clustering algorithms for
document data sets, Data Mining and Knowledge Discovery, Vol. 10, No. 2, pp.
141 - 168, 2005.
•
MALLET (Software)
– Java-based package for classification, clustering, topic modeling, and more…
– http://mallet.cs.umass.edu/
© Hayes/Smyth: Introduction to Biomedical Informatics: 45
Topic Modeling
Some slides courtesy of David Newman, UC Irvine
© Hayes/Smyth: Introduction to Biomedical Informatics: 46
Topics = Multinomial Distributions
© Hayes/Smyth: Introduction to Biomedical Informatics: 47
What is Topic Modeling?
• Topic = probability distribution (multinomial) over words
• Document is assumed to be a mixture of topics
– Each document is represented by a probability distribution over topics
– Note that this is different to clustering, which assigns each doc to 1 cluster
• Learning algorithm is completely unsupervised
– No labels required
• Output of learning algorithm
– T topics, each represented by a probability distribution over words
– For each document, what its mixture of topics is
– For each word in each document, what topic it is assigned to
© Hayes/Smyth: Introduction to Biomedical Informatics: 48
What is Topic Modeling useful for?
• Summarize large document collections
• Retrieve documents
• Automatically index documents
• Analyze text
• Measure trends
• Generate topic-based networks
© Hayes/Smyth: Introduction to Biomedical Informatics: 49
Topic Modeling vs. Other Approaches
• Clustering (summarization)
– Topic model typically outperforms clustering
• Clustering: document belongs to single cluster
• Topic modeling: document is composed of multiple topics
• Latent semantic indexing (theme discovery)
– Topics tend to be more interpretable than LSI “vectors”
• TF-IDF (information retrieval)
– Keywords are often too specific. Topics (or a combination of topics and
keywords) are usually better
© Hayes/Smyth: Introduction to Biomedical Informatics: 50
Topic Modeling > Theory
Topic Modeling vs. Clustering
The Topic Model has the advantage over clustering that it can assign multiple topics per document.
Hidden Markov Models in
Molecular Biology: New
Algorithms and Applications
Pierre Baldi, Yves Chauvin, Tim Hunkapiller, Marcella A.
McClure
Hidden Markov Models (HMMs) can be applied to
several important problems in molecular biology.
We introduce a new convergent learning
algorithm for HMMs that, unlike the classical
Baum-Welch algorithm is smooth and can be
applied on-line or in batch mode, with or without
the usual Viterbi most likely path approximation.
Left-right HMMs with insertion and deletion states
are then trained to represent several protein
families including immunoglobulins and kinases.
In all cases, the models derived capture all the
important statistical properties of the families and
can be used efficiently in a number of important
tasks such as multiple alignment, motif detection,
and classification.
One Cluster
[cluster 88]
model data
models time
neural figure
state learning set
parameters
network
probability
number networks
training function
system algorithm
hidden markov
Multiple Topics
[topic 10] state hmm
markov sequence models
hidden states probabilities
sequences parameters
transition probability training
hmms hybrid model
likelihood modeling
[topic 37] genetic structure
chain protein population
region algorithms human
mouse selection fitness
proteins search evolution
generation function
sequence sequences genes
© Hayes/Smyth: Introduction to Biomedical Informatics: 51
Topic Modeling
History
1990
Latent Semantic Analysis (Deerwester et al)
1999
Probabilistic Latent Semantic Analysis (Hoffman)
2003
Latent Dirichlet Allocation (Blei, Ng, Jordan)
2004
Gibbs sampling (Griffiths & Steyvers)
2004+
A variety of extensions and applications…….
© Hayes/Smyth: Introduction to Biomedical Informatics: 52
How are the Topics Learned?
• Most widely used algorithm is based on Gibbs Sampling
– This is essentially a stochastic search technique from statistics
• Sketch of algorithm, for learning T topics
– Initialize each word in each document to a number from 1 to T
– Cycle through each word and resample its assignment, based on
• Which other topics are assigned to words in this document
• Which other words are assigned to this topic
– Continue to cycle through the data until convergence or for a fixed number
of iterations
– Typically takes 20 to 100 iterations to converge
– Each iteration is linear in the number of word tokens in the corpus (good!)
– Algorithm can be easily parallelized
(Parallelization algorithm invented at UCI, now used by Google, Yahoo, etc)
© Hayes/Smyth: Introduction to Biomedical Informatics: 53
optional
Equation for Sampling the Topic for each Word
The topic modeling algorithm uses a probabilistic model to stochastically and
iteratively assign topics to every word in the corpus.
count of topic t
assigned to doc d
count of word w
assigned to topic t
ntdi  
nwti  
p( zi  t | zi ) 

i
i
n

T

n
t ' t 'd
w' w't  W
probability that word i is
assigned to topic t
© Hayes/Smyth: Introduction to Biomedical Informatics: 54
Word/Document counts
for 16 Artificial Documents
stream
documents
River
river
Stream
bank
Bank
money
Money
loan
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Can we recover the original topics and topic mixtures from this data?
P. Smyth: Fraunhofer IAIS, May 2010: 55
Example of Gibbs Sampling
• Assign word tokens randomly to 2 topics:
stream
River
river
Stream
bank
Bank
money
Money
loan
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
P. Smyth: Fraunhofer IAIS, May 2010: 56
After 1 iteration
stream
River
river
Stream
bank
Bank
money
Money
loan
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
P. Smyth: Fraunhofer IAIS, May 2010: 57
After 4 iterations
stream
River
river
Stream
bank
Bank
money
Money
loan
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
P. Smyth: Fraunhofer IAIS, May 2010: 58
After 32 iterations
topic 1
stream .40
bank .35
river .25
stream
River
river
Stream
bank
Bank
topic 2
bank .39
money .32
loan .29
money
Money
loan
Loan
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
P. Smyth: Fraunhofer IAIS, May 2010: 59
Explaining Topic Modeling with New York Times Documents
Topic Modeling begins with the text from a corpus of documents…
About a month after the attacks of Sept. 11, firefighters widows began showing up at an obscure
investment firm on Long...¶ To hear Raymond W. Baker tell it, companies in the United States and
throughout the West have it exactly backward when i...¶ You dont have to be honest to use the Internet.
E-mail scams are a growing problem. Youve almost certainly heard from ...¶ In a city where banks sprout
like mushrooms, and die just as fast, the money-moving network described in the unfolding B...¶ The
federal governments chief witness against Robert E. Brennan, the former penny-stock promoter,
concluded his testimo...¶ From the well-appointed offices of Lehman Brothers in Manhattan, federal
prosecutors say, Consuelo Marquez, a Barnard Co...¶ Appearing calm and rested, and smiling occasionally
at one another and their lawyers, Edwards and Berlin described the o. Cary Stayner was sentenced
Thursday to die for committing three murders near Yosemite National Park in 1999. For FBI age ...¶ As
murders go, it was the sort guaranteed to fetch modest attention. Two men were held up at gunpoint in a
restaurant pa ...¶ Teachers loaded some of the elementary school children onto buses. Another group
was hustled down the hill to the old wo ...¶ It started as a whisper from nowhere. Then the whispers
became voices. James Michael Brown had no choice but to listen. ...¶ A day is worth more than 24 hours
to Coral Eugene Watts. Its worth 72, a three-for-one bargain, a triple-time countdown ....¶ The middleaged man and the teenager were footloose traveling companions on a fathomless mission of horror. For
three we ...¶ A jury convicted Michael C. Skakel on Friday of murder in the 1975 bludgeoning death of
Martha Moxley, using more than a ...¶ On July 13, 1990, an expert testified in a Manhattan court that
DNA analysis of semen found in the victim in the Central ...¶ As the man suspected of killing as many as
18 women sat in a downtown prison cell, investigators fielded calls from othe ...¶ For Johnson, an exMarine, stucco specialist and associate pastor of a small Kansas City church, the next couple of hour ...¶
© Hayes/Smyth: Introduction to Biomedical Informatics: 60
Initially, topics are randomly assigned to the
words in each document
For this example, we’ve set the number of topics to 10.
19757 bludgeoning2 convicted4 death10 Friday6 jury2 Martha9 Michael3
more3 Moxley9 murder6 Skakel5 using2 ¶ against4 Brennan3 chief1
concluded3 federal8 former2 governments4 penny-stock6 promoter3
Robert9 testimony4 witness1...
© Hayes/Smyth: Introduction to Biomedical Informatics: 62
Words get assigned probabilistically to topics
19757 bludgeoning2 convicted4 death10 Friday6 jury2 Martha9 Michael3
more3 Moxley9 murder6 Skakel5 using2 ¶ against4 Brennan3 chief1
concluded3 federal8 former2 governments4 penny-stock6 promoter3
Robert9 testimony4 witness1...
turns into
19753 bludgeoning3 convicted3 death3 Friday5 jury3 Martha1 Michael9
more5 Moxley3 murder3 Skakel3 using6 ¶ against6 Brennan8 chief8
concluded5 federal8 former8 governments9 penny-stock8 promoter8
Robert1 testimony8 witness3...
© Hayes/Smyth: Introduction to Biomedical Informatics: 63
Topics are expressed as lists of words likely to cooccur in individual documents
#3: murder police victim killing shot killed death crime violence shooting found
suspect killer kill woman told case night scene investigator
#8: money million business dollar account cash paid pay scandal financial trust
fraud corruption bank scheme laundering illegal deal profit transfer
19753 bludgeoning3 convicted3 death3 Friday5 jury3 Martha1
Michael9 more5 Moxley3 murder3 Skakel3 using6 ¶ against6
Brennan8 chief8 concluded5 federal8 former8 governments9
penny-stock8 promoter8 Robert1 testimony8 witness3...
© Hayes/Smyth: Introduction to Biomedical Informatics: 64
Humans Interpret Topics
After the model produces the topics, a human domain expert interprets the list of most
likely words and chooses a topic name.
topic #3:
murder police victim killing shot killed death
crime violence shooting found suspect killer kill
woman …
topic #3 =
“Crime”
topic #8 =
“Financial
Corruption”
topic #8:
money million business dollar account cash paid pay
scandal financial trust fraud corruption bank scheme
laundering illegal …
© Hayes/Smyth: Introduction to Biomedical Informatics: 65
Topic Modeling can be applied to various types of text …
Collection
New York
Times
# docs
300,000
Description
News articles from New York Times and other newspapers 2000-2002
Austen
1,400
The six Jane Austen novels, broken up into 100-line sections
Blogs
4,000
Blog entries harvested from Daily Kos
Bible
1,200
Chapters in the bible (KJV)
Police
Reports
250,000
Police accident reports from North Carolina
CiteSeer
750,000
Abstracts from research publications in computer science and engineering
Search
Queries
1,000,000
Queries issued to web search engine
Enron
250,000
Enron emails seized by the US Government for the federal case against
the company
Library of
Congress
240,000
Metadata records from Library of Congress American Memory Collection
© Hayes/Smyth: Introduction to Biomedical Informatics: 66
… sample topics
Collection
Sample Topic
New York
Times
[WMD] IRAQ iraqi weapon war SADDAM_HUSSEIN SADDAM resolution
UNITED_STATES military inspector U_N UNITED_NATION BAGHDAD inspection
action SECURITY_COUNCIL
Austen
[SENTIMENT] felt comfort feeling feel spirit mind heart ill evil fear impossible hope poor
distress end loss relief suffering concern dreadful misery unhappy
Blogs
[ELECTIONS] november poll house electoral governor polls account ground
republicans trouble
Bible
[COMMANDS] thou thy thee shalt thine lord god hast unto not shall
Police
Reports
[RAN OFF ROAD] v1 off road ran came rest ditch traveling struck side shoulder tree
overturned control lost
CiteSeer
[GRAPHS] graph edge vertices edges vertex number directed connected degree
coloring subgraph set drawing
Search
Queries
[CREDIT] credit card loans bill loan report bad visa debt score
Enron
[ENERGY CRISIS] state california power electricity utilities davis energy prices
generators edison public deregulation billion governor federal consumers commission
plants companies electric wholesale crisis summer
Library of
Congress
[ARMY] military camp army war officer personnel british regiment tent crimean crimea
soldier guard general infantry
© Hayes/Smyth: Introduction to Biomedical Informatics: 67
Case Study: Analysis of DNA Microarray Literature
• Government-sponsored study of > 50,000 papers from PubMed that
are related to DNA microarrays
–
–
–
–
–
–
Papers date from early 1990’s to 2012
48,872 documents
59,408 unique words
6.1 million word tokens
191,569 unique authors
4,455 different journals
• Ran a topic model with T=100 topics to gain insight into what the
different sub-areas are within the field of DNA microarrays and how
these have changed over time
© Hayes/Smyth: Introduction to Biomedical Informatics: 68
Examples of Topics Learned
Displayed below are the top 5 highest probability words for 5 selected topics
Microarray Chip
Technology
Classification
Methods
Databases
and
Annotation
Regulatory
Networks
Patients and
Survival
detection
classification
databases
network
patient
surface
selection
tool
regulatory
tumor
fluorescence
cancer
annotation
pathway
cancer
hybridization
algorithm
data set
interaction
survival
array
feature
web
transcriptional
prognostic
From basic technology to applications
© Hayes/Smyth: Introduction to Biomedical Informatics: 69
How Topics Change over Time
Basic Technology
microarray technology
hybridization
probe technology
sequencing
0.16
0.12
0.1
0.08
0.06
0.04
patients and treatment
classification
pathways and networks
stem cells
0.025
FRACTION OF WORDS PER TOPIC
0.14
FRACTION OF WORDS PER TOPIC
Applications
0.02
0.015
0.01
Base Rate
0.005
0.02
Base Rate
0
1990
1995
2000
YEAR
2005
2010
0
1990
1995
2000
YEAR
2005
2010
Selected topics with largest changes in words assigned to the topic:
negative (left plot) and positive (right plot)
© Hayes/Smyth: Introduction to Biomedical Informatics: 70
New York Times 2000-02: Sample Topics
Basketball
Tour de France
Holidays
team
play
game
season
final
games
point
series
player
coach
playoff
championship
playing
win
0.028
0.015
0.013
0.012
0.011
0.011
0.011
0.011
0.010
0.009
0.009
0.007
0.006
0.006
tour
rider
riding
bike
team
stage
race
won
bicycle
road
hour
scooter
mountain
place
0.039
0.029
0.017
0.016
0.016
0.014
0.013
0.012
0.010
0.009
0.009
0.008
0.008
0.008
holiday
gift
toy
season
doll
tree
present
giving
special
shopping
family
celebration
card
tradition
0.071
0.050
0.023
0.019
0.014
0.011
0.008
0.008
0.007
0.007
0.007
0.007
0.007
0.006
LAKERS
SHAQUILLE-O-NEAL
KOBE-BRYANT
PHIL-JACKSON
NBA
SACRAMENTO
RICK-FOX
PORTLAND
ROBERT-HORRY
DEREK-FISHER
0.062
0.028
0.028
0.019
0.013
0.007
0.007
0.006
0.006
0.006
LANCE-ARMSTRONG
FRANCE
JAN-ULLRICH
LANCE
U-S-POSTAL-SERVICE
MARCO-PANTANI
PARIS
ALPS
PYRENEES
SPAIN
0.021
0.011
0.003
0.003
0.002
0.002
0.002
0.002
0.001
0.001
CHRISTMAS
THANKSGIVING
SANTA-CLAUS
BARBIE
HANUKKAH
MATTEL
GRINCH
HALLMARK
EASTER
HASBRO
0.058
0.018
0.009
0.004
0.003
0.003
0.003
0.002
0.002
0.002
© Hayes/Smyth: Introduction to Biomedical Informatics: 71
New York Times 2000-02: Sample Topics
September 11 Attacks
FBI Investigation
DC Sniper
attack
tower
firefighter
building
worker
terrorist
victim
rescue
floor
site
disaster
twin
ground
center
fire
plane
0.033
0.025
0.020
0.018
0.013
0.012
0.012
0.012
0.011
0.009
0.008
0.008
0.008
0.008
0.007
0.007
agent
investigator
official
authorities
enforcement
investigation
suspect
found
police
arrested
search
law
arrest
case
evidence
suspected
0.029
0.028
0.027
0.021
0.018
0.017
0.015
0.014
0.014
0.012
0.012
0.011
0.011
0.010
0.009
0.008
sniper
shooting
area
shot
police
killer
scene
white
victim
attack
case
left
public
suspect
killed
car
0.024
0.019
0.010
0.009
0.007
0.006
0.006
0.006
0.006
0.005
0.005
0.005
0.005
0.005
0.005
0.005
WORLD-TRADE-CTR
NEW-YORK-CITY
LOWER-MANHATTAN
PENTAGON
PORT-AUTHORITY
RED-CROSS
NEW-JERSEY
RUDOLPH-GIULIANI
PENNSYLVANIA
CANTOR-FITZGERALD
0.035
0.020
0.005
0.005
0.003
0.002
0.002
0.002
0.002
0.001
FBI
MOHAMED-ATTA
FEDERAL-BUREAU
HANI-HANJOUR
ASSOCIATED-PRESS
SAN-DIEGO
U-S
FLORIDA
0.034
0.003
0.001
0.001
0.001
0.001
0.001
0.001
WASHINGTON
VIRGINIA
MARYLAND
D-C
JOHN-MUHAMMAD
BALTIMORE
RICHMOND
MONTGOMERY-CO
MALVO
ALEXANDRIA
0.053
0.019
0.013
0.012
0.008
0.006
0.006
0.005
0.005
0.003
© Hayes/Smyth: Introduction to Biomedical Informatics: 72
Topic Trends (New York Times data set)
kwords
kwords
kwords
kwords
40
200
Basketball
Sept-11-Attacks
20
100
0
0
20
200
Tour-de-France
Anthrax
10
100
0
0
40
100
Oscars
DC-Sniper
20
50
0
0
40
100
Quarterly-Earnings
20
0
Jan00
Enron
50
Jan01
Jan02
Jan03
0
Jan00
Jan01
Jan02
Jan03
© Hayes/Smyth: Introduction to Biomedical Informatics: 73
From Dr. David Newman
Computer Science Department
UC Irvine
https://app.nihmaps.org/nih/browser/
© Hayes/Smyth: Introduction to Biomedical Informatics: 74
Background Resources: Topic Modeling
•
D. Blei, Introduction to Probabilistic topic models, Communications of the
ACM, preprint, 2011
– http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf
•
D. Blei and J. Lafferty, Topic models, in A. Srivastava and M. Sahami,
editors, Text Mining: Classification, Clustering, and Applications . Chapman &
Hall/CRC Data Mining and Knowledge Discovery Series, 2009.
– http://www.cs.princeton.edu/~blei/papers/BleiLafferty2009.pdf
•
Steyvers, M. & Griffiths, T. (2007) Probabilistic topic models. In T. Landauer, D
McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road
to Meaning. Laurence Erlbaum
•
General resources (papers, code, newsgroup, etc)
– http://www.cs.princeton.edu/~blei/topicmodeling.html
© Hayes/Smyth: Introduction to Biomedical Informatics: 75
Additional Backup Slides on Application of Topic
Modeling to Domain-Specific Browsing
© Hayes/Smyth: Introduction to Biomedical Informatics: 76
Application: Domain-Specific Browsers
• Collect a large set of documents in a particular domain
– Build a domain-specific topic model for the documents
– Build a browser based on the topic model
• Example: Schizophrenia Research
– 40,000 abstracts from PubMed mention schizophrenia
– We built a topic model with 200 topics
– Also automatically extracted
• Gene names/symbols
• Brain regions
– Developed a browser/search-tool that uses the topics to
allow a user to “navigate” through the world schizophrenia
– Combines statistical model with real-time search on
documents
– Master’s thesis work by Vasanth Kumar, ICS MS 2006
P. Smyth: Fraunhofer IAIS, May 2010: 77
P. Smyth: Fraunhofer IAIS, May 2010: 78
Topics are discovered
automatically, and span the “space” of
schizophrenia research.
Only topic naming is done manually
P. Smyth: Fraunhofer IAIS, May 2010: 79
P. Smyth: Fraunhofer IAIS, May 2010: 80
P. Smyth: Fraunhofer IAIS, May 2010: 81
P. Smyth: Fraunhofer IAIS, May 2010: 82
P. Smyth: Fraunhofer IAIS, May 2010: 83
P. Smyth: Fraunhofer IAIS, May 2010: 84
Information Extraction
© Hayes/Smyth: Introduction to Biomedical Informatics: 85
Information Extraction
• Broad goal:
– Convert unstructured text into structured form (e.g., database)
– Many applications, e.g.,
• Mining of free-form narratives in clinical diagnoses
• Extraction of scientific information from biomedical papers
© Hayes/Smyth: Introduction to Biomedical Informatics: 86
Information Extraction
• Broad goal:
– Convert unstructured text into structured form (e.g., database)
– Many applications, e.g.,
• Mining of free-form narratives in clinical diagnoses
• Extraction of scientific information from biomedical papers
• Main Categories:
– Named entity recognition (NER)
• E.g., recognize disease names, genes, proteins, drugs, in text
• Can be challenging: e.g., genes are referred to in multiple ways
– Relation extraction
• Given named entities, detect how they are related, e.g.,
– Gene A regulates Gene B
– Drug X can cause adverse symptoms Y and Z
• Harder than named entity recognition
– Much variety and subtlety in how we describe relational information
© Hayes/Smyth: Introduction to Biomedical Informatics: 87
Example of Information Extraction
From Sarawagi,
Foundations and Trends in Databases, 2008
© Hayes/Smyth: Introduction to Biomedical Informatics: 88
Information Extraction Techniques
• Dictionaries
– E.g., lists of names of genes, proteins, diseases, drugs
• Natural Language Processing
– High-quality parsers can parse sentences into parts of speech
– E.g., noun-phrases
– Subject-object-verb triples useful for relation extraction
• Rule-based Algorithms
– E.g., if word is capitalized and a noun, then its an entity
• Machine learning approaches
– e.g., use human-labeled gene name mentions to train a classifier
– Features can be parts of speech, nearby words, capitalization, etc
© Hayes/Smyth: Introduction to Biomedical Informatics: 89
Example of Part of Speech (POS) Tagging
(from Chris Manning, Stanford)
RB
NNP
NNP
RB
VBD
,
JJ
NNS
When
Mr.
Holly
last
wrote
,
many
years
© Hayes/Smyth: Introduction to Biomedical Informatics: 90
Example of Phrase Parsing
(from Chris Manning, Stanford)
S
VP
NP
VP
VBD
PP
NP
VBN
IN
NP
NP
IN
NNS
Bills
PP
on
NN
NNS
CC
ports
and immigration
NNP
were
submitted
by
Senator
NNP
Brownback
© Hayes/Smyth: Introduction to Biomedical Informatics: 91
From Krallinger, Valencia, Hirschman,
Genome Biology, 2008
© Hayes/Smyth: Introduction to Biomedical Informatics: 92
From Savova et al,
J. Am. Med. Inform. Assoc, 2010
© Hayes/Smyth: Introduction to Biomedical Informatics: 93
From Erhadt, Schneider, Blaschke,
Drug Discovery Today, 2006
© Hayes/Smyth: Introduction to Biomedical Informatics: 94
From Jensen, Saric, Bork,
Nature Reviews Genetics, 2003
© Hayes/Smyth: Introduction to Biomedical Informatics: 95
From Feldman, Regev, Hurvitz, and Finkelstein-Landau,
Biosilico, 2003
© Hayes/Smyth: Introduction to Biomedical Informatics: 96
From Feldman, Regev, Hurvitz, and Finkelstein-Landau,
Biosilico, 2003
© Hayes/Smyth: Introduction to Biomedical Informatics: 97
“Buzzword Hunting”
From Jensen, Saric, Bork,
Nature Reviews Genetics, 2003
© Hayes/Smyth: Introduction to Biomedical Informatics: 98
Commercial Software, e.g., IBM Medical Record Text Analytics
© Hayes/Smyth: Introduction to Biomedical Informatics: 99
Challenges in Information Extraction
• Fundamentally hard
– Ambiguity in natural language is difficult for computers to handle
– Current systems are reasonably accurate …. but room for improvement
• Validation
– Measuring recall for example can be difficult
• Calibration
– How do we calibrate a method so that we know how reliable it is?
• Integration
– How should text-based indicators be combined with actual experimental
data?
© Hayes/Smyth: Introduction to Biomedical Informatics: 100
Resources in Information Extraction
• Review paper on basic technology:
– Information Extraction, by Sunita Sarawagi, in Foundations and Trends in
Databases, 2008
• Reviews with a biomedical focus:
– Erhardt, Schneider, and Blaschke, Status of text-mining techniques
applied to biomedical text. Drug Discovery Today, 11(7/8), pp.315-325,
2006
– Jensen, Saric, Bork, Literature mining for the biologist: from information
to biological discovery, Nature Reviews: Genetics, 7, Feb 2006.
• Additional papers on the class Web site
© Hayes/Smyth: Introduction to Biomedical Informatics: 101
Software for Natural Language Processing
(lists from Chris Manning, Stanford)
•
NLP Packages
–
–
–
–
•
NLTK Python
• http://www.nltk.org/
OpenNLP
• http://incubator.apache.org/opennlp/
Stanford NLP
• http://nlp.stanford.edu/software/
LingPipe
• http://alias-i.com/lingpipe/
NLP Frameworks
–
GATE – General Architecture for Text Engineering (U. Sheffield)
• http://gate.ac.uk/
• Java, quite well maintained (now)
–
UIMA – Unstructured Information Management Architecture. Originally IBM; now Apache
project
• http://uima.apache.org/
• Professional, scalable, etc.
–
NLTK – Natural Language To0lkit (started by Steven Bird)
• http://www.nltk.org/
• Big community; large Python package; corpora and books about it
© Hayes/Smyth: Introduction to Biomedical Informatics: 102
Download