Conference Paper - Department of Electrical Engineering and

advertisement
Big Data Analysis
Mario Massad, Matthew Toschi, Tyler Truong
Abstract – Every day, more articles of unstructured data
populate the internet where they are left untouched. Much
of it contains raw text, unlabeled and unclassified.
Analyzing these enormous amounts of data can lead to
making discoveries and trends amongst it. This
information, when fully realized, can be utilized to find
unknown relationships between the entities it contains, such
as people, businesses, or other groups. Taking these
unstructured documents, we use natural language
processing and named entity recognition to identify entities
of people, organizations, and locations, and recognize the
connections which link them together. Implementing Latent
Dirichlet Allocation (LDA), the similarity between
documents is discovered. And together, the relevancy of
documents and the similarities of the entities they share can
shed light on connections previously undiscovered.
Keywords: Natural Language Processing, Clustering
methods, Sentiment Analysis, Named Entity Recognition
I. INTRODUCTION
Connecting unstructured data in a way that is
meaningful is an ongoing problem in today’s world.
When looking at unstructured data such as new articles,
computers see this data as simply strings. Machines do
not interpret words like a human does. Humans recognize
words and see objects such as people and organizations.
Humans can make the connection between one sentence
and the next. If a computer were to have the ability to see
things in the same way, it could read hundreds of
documents at a time and make connections between
entities faster than any human could. Taking this
unstructured data, the first step is to decompose sentences
into their significant components using several natural
language processing techniques such as parts-of-speech
tagging, which is the process of marking up the
individual components such as nouns, verbs, and
adjectives. To the system, the most relevant elements of
a sentence are nouns and verbs. Nouns give information
about person, locations and organizations that would be
considered relevant to the article, and verbs provide the
relations between nouns. In addition to parts-of-speech
tagging, we implemented dependency tree parsing, which
allows us to observe the relations of each word in a
sentence to another word. These natural language
processing techniques are implemented using Stanford’s
CoreNLP library. The information that is extracted is
stored into a MongoDB database, used to fit our
unstructured data. Extracting these details from the
database, the system implements latent Dirichlet
allocation (LDA). Latent Dirichlet Allocation allows sets
of observations to be explained by unobserved groups
[1]. LDA represents each article of unstructured data as a
mixture of topics. It is an unsupervised, statistical
approach for modeling text corpora by discovering latent
semantic topics in large collection of text documents [2].
In addition to text, LDA can be applied to images and
music; however, our research focuses on the natural
language our articles. The idea behind LDA is that when
looking at documents, documents that contain similar
content will most likely contain similar words. Sets of
words that often co-occur together are mentioned as
topics. Documents that contain similar topic structure can
be seen as relevant. Document A could contain “aircraft”,
“Siemens” as topics, and Document B could contain
“aircraft”, “Boeing”. These two documents would have
some relevance to each other. The word “aircraft” is what
relates the two. With this information, one could use it to
discover connections between entities that might have
been previously undiscovered.
II. NATURAL LANGUAGE PROCESSING
Natural language processing (nlp) in the field of
computer science involving the interaction between
computers and the human language. NLP algorithms can
be used to extract meaningful information out of
documents. We found Stanford’s CoreNLP, a natural
language processing library, to be the most viable tool
this project. In our research, we considered 4 different
natural language processing techniques that are all
provided by Stanford CoreNLP:
 Part-of-speech tagging – process of marking
up every word in a document to its particular
part of speech (noun, verb, adjective, etc.). Each
sub-phrase in a document gets marked.
 Dependency parsing – Unlike P-O-S tagging,
dependency parsing breaks down a document
based on word relations. Words that are related
to each other are connected based on direct
links.


Named-entity recognition – Subtask of
information extraction involving locating and
classifying elements in text into pre-defined
categories such as persons, organizations,
locations, etc.
Sentiment Analysis – Branch of natural
language processing and text analysis, to
identify and extract subjective information in
text. Determines the attitude of the text,
“positive” or “negative”.
III. PART-OF-SPEECH TAGGING
Part-of-speech tagging, or grammatical tagging,
is the process of marking up words in a text to their
corresponding part of speech. These tags are based on
each word’s definition as well as its context. The
sentence “He set the food on the table.” gives us the
following tagged sentence.
PRP
He
VBD
set
DT
the
NN
food
IN
on
DT
the
NN
ta
ble .
In this sentence, the word “set” is considered as a verb;
however, in another example, “The set of numbers were
sorted” provides us with:
The NN set
orted . .
DT
IN
of
NNS
numbers
VBD
were
VBN
s
“set” is considered a noun. Eight main parts of speech in
English are: noun, verb, adjective, preposition, pronoun,
adverb, conjunction, interjection; however, there are
many more parts of speech tags that stem off of these
eight. When evaluating a sentence in a text, we look to
extract primarily nouns and verbs from text. These help
us focus on what is significant. Nouns tells us
information such as peoples and organizations, while
verbs provide us the connection between entities.
IV. DEPENDENCY PARSING
As mentioned earlier, dependency parsing
involves breaking down text based on word relations.
While POS tagging focuses on each words significance
to a sentence, dependency parsing connects each word
using direct links or weights. Each word depends on one
parent, which is another word or a root symbol.
Example sentence [3]
Each word in the sentence above connects to one other
word in the text based on some relation. Stanford
CoreNLP defines the above sentence as:
nsubj(hit-2, John-1)
root(ROOT-0, hit-2)
det(ball-4, the-3)
dobj(hit-2, ball-4)
prep(hit-2, with-5)
det(bat-7, the-6)
pobj(with-5, bat-7)
The words “hit” and “John” relation is defined by nsub
(nominal subject). A nominal subject is a noun phrase
which is the syntactic subject of a clause. Each of these
relation definition gives us insight in how nouns are
related to each other. Dependency trees are obtained by a
maximum spanning tree algorithm [3]. Edges are placed
between each pair of words in a sentence. Edges are
scored based on the dot product between a high
dimensional feature representation of the edge and
weight vector [3].
𝑠(𝑖, 𝑗) = 𝑤 ∙ 𝑓(𝑖, 𝑗)
The score of a dependency tree y for sentence x is,
𝑠(𝑥, 𝑦) = ∑ 𝑠(𝑖, 𝑗) = ∑ 𝑤 ∙ 𝑓(𝑖, 𝑗)
(𝑖,𝑗)∈𝑦
(𝑖,𝑗)∈𝑦
Provided by [3]
V. NAMED-ENTITY RECOGNITION
Considered a subtask of information extraction.
Named-entity recognition (NER) involves locating and
classifying elements in a text into pre-defined categories.
Using Stanford CoreNLP, we attempt to locate the
following categories: persons, organizations, locations,
money, percentages, time, and money. These seven
categories assist in discovering the topics of each
document. NER systems commonly use linguistic
grammar based models as well as statistical models such
as machine learning. Popular models such as Stanford
CoreNLP and OpenNLP take a statistical approach to
named-entity recognition. Statistical models require
large amounts of hand-annotated data; however, some
use semi-supervised approaches to get around this.
Each time we classify a new sentence fragment,
it cycles back into our data where the classifier will be
retrained and used to achieve better results.
VI. SENTIMENT ANALYSIS
Sentiment Analysis attempts to extract
subjective information from text. This stage of natural
language processing does not provide any new
information about the text such as entities and relations;
however, it gives insight on the “vibe” of the text, such
as positive and negative. For instance, when examining
Twitter status, one can usually classify tweets as
“positive” and “negative”. “Company A claimed
bankruptcy”, would fall under “negative”, while the
sentence “Company A sold for 2 million dollars” would
be classified as “positive” sentiment. Sentiment analysis
conducted over a period of time can reveal meaningful
information about a company or entity over a period of
time; for example, whether a company is doing well or
seem to be going under. However, based on the basic
construction of news articles, sentiment analysis proves
to be slightly more difficult. When doing sentiment
analysis on Twitter Tweets, it is often distinct on whether
a tweet is “positive” or “negative”. A tweet on Twitter
can often be something as positive such as “My Team is
Winning!!!” or as negative as “THIS album SUCKS!”.
However, news articles are often written in an unbiased
way, and conducting sentiment analysis on them can
prove to be a challenge. In an attempt to gain better
results, we turned to supervised training to train our
classifier to achieve better results. In this attempt, we
took fragments of sentences from our texts, and hand
classified each sentence fragment on a scale from “very
negative” to “very positive”.
Example of classifying fragments
Sentiment Analysis Pipeline
VII. STORAGE AND PROCESSING
As the amount of data in the world has grown
tremendously, so has the debate of SQL vs. NoSQL
databases. SQL is the standard language for relational
database management systems, and NoSQL attempts to
use a key-value structure with no predetermined rows or
columns that are seen in SQL databases. One of the most
popular NoSQL databases is MongoDB. MongoDB
carries no strict rules on data-relations which is the best
fit for the unstructured format of our data. Articles that
are imported into the system are stored using MongoDB
GridFS, where they are exported to be processed and
broken down using Stanford CoreNLP’s natural language
processing techniques. Processed information are stored
as shown below.
with Soccer games. Articles that have these words in their
text might fall under Topic 1. In addition, one document
might fall under more than one topic. Document X could
contain 40% topic 1 and 60% topic 2. These probabilities
help us find which articles are relevant to the user’s query
or another article.
IX. QUERY AND USER INTERFACE
MongoDB Document
With all of this information available, we look
to allow users access these articles. A user’s query may
involve a simple collection of words that the user wants
to know more about. An example would be “aircraft” or
“power plants”. Taking the user’s query, we break it
down to it individual components and user these to run
against our LDA table. The articles that match up with
the greatest probabilities are assumed to be the most
relevant to the query. From here users are permitted to
scan through the initial list of documents. Each
documents contains links to top most relevant
documents. A query for “aircraft” would receive any
article that has to do with the term.
X. CONCLUSION
VIII. LATENT DIRICHLET ALLOCATION
As we break down and process each
unstructured news article, we require a technique that
will allow us to pair relevant articles together. Clustering
algorithms involve grouping similar elements together.
We look at latent Dirichlet allocation. LDA is an
unsupervised, statistical approach used for modeling text
to find latent semantic topics in text documents. It is
assumed that documents are roughly similar if they
contain similar words. Words are treated as random
variables. We can compute the probability distribution
over each latent variables that are conditioned on the
observed variables.
Topic 1
Topic 2
Topic 3
term
weight
term
weight
term
weight
game
0.014
oil
0.021
food
0.021
team
0.011 Siemens 0.006
animal 0.015
soccer 0.009
electric
0.006
dog
0.013
play
0.008
power
0.005 healthy 0.0012
Example LDA table – 3 topics
In the table above, 3 topics appeared. Each
topics has a group of words that represent its respective
topic. The terms in topics 1 seem to have something to do
Using natural language process techniques and
latent Dirichlet allocation, we aimed to achieve a system
that is capable of delivering a system where users can
search (query) for articles in a way where users aren’t
just getting articles through a keyword search, but are
given articles based on relevancy. This information
could potentially shed light on relations between entities
throughout time.
XI. ACKNOWLEDGEMENT
We wish to acknowledge the assistance and
support of Siemens AG, Dr. Matthew Evans Director of
Smarter Services at Siemens, Akshay Patwal Strategic
Business Manager at Siemens, and Pawel Wocjan
Professor at the University of Central Florida.
XII. REFERENCES
[1] Marie-Catherine de Marneffe, Christopher D.
Manning, “Stanford typed dependencies manual”
[2] Diane J. Hu, “Latent Dirichlet Allocation for Text,
Images, and Music”
[3] Ryan McDonald, Fernando Pereira, “Non-projective
Dependency Parsing using Spanning Tree Algorithms”,
Download