Big Data Analysis Mario Massad, Matthew Toschi, Tyler Truong Abstract – Every day, more articles of unstructured data populate the internet where they are left untouched. Much of it contains raw text, unlabeled and unclassified. Analyzing these enormous amounts of data can lead to making discoveries and trends amongst it. This information, when fully realized, can be utilized to find unknown relationships between the entities it contains, such as people, businesses, or other groups. Taking these unstructured documents, we use natural language processing and named entity recognition to identify entities of people, organizations, and locations, and recognize the connections which link them together. Implementing Latent Dirichlet Allocation (LDA), the similarity between documents is discovered. And together, the relevancy of documents and the similarities of the entities they share can shed light on connections previously undiscovered. Keywords: Natural Language Processing, Clustering methods, Sentiment Analysis, Named Entity Recognition I. INTRODUCTION Connecting unstructured data in a way that is meaningful is an ongoing problem in today’s world. When looking at unstructured data such as new articles, computers see this data as simply strings. Machines do not interpret words like a human does. Humans recognize words and see objects such as people and organizations. Humans can make the connection between one sentence and the next. If a computer were to have the ability to see things in the same way, it could read hundreds of documents at a time and make connections between entities faster than any human could. Taking this unstructured data, the first step is to decompose sentences into their significant components using several natural language processing techniques such as parts-of-speech tagging, which is the process of marking up the individual components such as nouns, verbs, and adjectives. To the system, the most relevant elements of a sentence are nouns and verbs. Nouns give information about person, locations and organizations that would be considered relevant to the article, and verbs provide the relations between nouns. In addition to parts-of-speech tagging, we implemented dependency tree parsing, which allows us to observe the relations of each word in a sentence to another word. These natural language processing techniques are implemented using Stanford’s CoreNLP library. The information that is extracted is stored into a MongoDB database, used to fit our unstructured data. Extracting these details from the database, the system implements latent Dirichlet allocation (LDA). Latent Dirichlet Allocation allows sets of observations to be explained by unobserved groups [1]. LDA represents each article of unstructured data as a mixture of topics. It is an unsupervised, statistical approach for modeling text corpora by discovering latent semantic topics in large collection of text documents [2]. In addition to text, LDA can be applied to images and music; however, our research focuses on the natural language our articles. The idea behind LDA is that when looking at documents, documents that contain similar content will most likely contain similar words. Sets of words that often co-occur together are mentioned as topics. Documents that contain similar topic structure can be seen as relevant. Document A could contain “aircraft”, “Siemens” as topics, and Document B could contain “aircraft”, “Boeing”. These two documents would have some relevance to each other. The word “aircraft” is what relates the two. With this information, one could use it to discover connections between entities that might have been previously undiscovered. II. NATURAL LANGUAGE PROCESSING Natural language processing (nlp) in the field of computer science involving the interaction between computers and the human language. NLP algorithms can be used to extract meaningful information out of documents. We found Stanford’s CoreNLP, a natural language processing library, to be the most viable tool this project. In our research, we considered 4 different natural language processing techniques that are all provided by Stanford CoreNLP: Part-of-speech tagging – process of marking up every word in a document to its particular part of speech (noun, verb, adjective, etc.). Each sub-phrase in a document gets marked. Dependency parsing – Unlike P-O-S tagging, dependency parsing breaks down a document based on word relations. Words that are related to each other are connected based on direct links. Named-entity recognition – Subtask of information extraction involving locating and classifying elements in text into pre-defined categories such as persons, organizations, locations, etc. Sentiment Analysis – Branch of natural language processing and text analysis, to identify and extract subjective information in text. Determines the attitude of the text, “positive” or “negative”. III. PART-OF-SPEECH TAGGING Part-of-speech tagging, or grammatical tagging, is the process of marking up words in a text to their corresponding part of speech. These tags are based on each word’s definition as well as its context. The sentence “He set the food on the table.” gives us the following tagged sentence. PRP He VBD set DT the NN food IN on DT the NN ta ble . In this sentence, the word “set” is considered as a verb; however, in another example, “The set of numbers were sorted” provides us with: The NN set orted . . DT IN of NNS numbers VBD were VBN s “set” is considered a noun. Eight main parts of speech in English are: noun, verb, adjective, preposition, pronoun, adverb, conjunction, interjection; however, there are many more parts of speech tags that stem off of these eight. When evaluating a sentence in a text, we look to extract primarily nouns and verbs from text. These help us focus on what is significant. Nouns tells us information such as peoples and organizations, while verbs provide us the connection between entities. IV. DEPENDENCY PARSING As mentioned earlier, dependency parsing involves breaking down text based on word relations. While POS tagging focuses on each words significance to a sentence, dependency parsing connects each word using direct links or weights. Each word depends on one parent, which is another word or a root symbol. Example sentence [3] Each word in the sentence above connects to one other word in the text based on some relation. Stanford CoreNLP defines the above sentence as: nsubj(hit-2, John-1) root(ROOT-0, hit-2) det(ball-4, the-3) dobj(hit-2, ball-4) prep(hit-2, with-5) det(bat-7, the-6) pobj(with-5, bat-7) The words “hit” and “John” relation is defined by nsub (nominal subject). A nominal subject is a noun phrase which is the syntactic subject of a clause. Each of these relation definition gives us insight in how nouns are related to each other. Dependency trees are obtained by a maximum spanning tree algorithm [3]. Edges are placed between each pair of words in a sentence. Edges are scored based on the dot product between a high dimensional feature representation of the edge and weight vector [3]. 𝑠(𝑖, 𝑗) = 𝑤 ∙ 𝑓(𝑖, 𝑗) The score of a dependency tree y for sentence x is, 𝑠(𝑥, 𝑦) = ∑ 𝑠(𝑖, 𝑗) = ∑ 𝑤 ∙ 𝑓(𝑖, 𝑗) (𝑖,𝑗)∈𝑦 (𝑖,𝑗)∈𝑦 Provided by [3] V. NAMED-ENTITY RECOGNITION Considered a subtask of information extraction. Named-entity recognition (NER) involves locating and classifying elements in a text into pre-defined categories. Using Stanford CoreNLP, we attempt to locate the following categories: persons, organizations, locations, money, percentages, time, and money. These seven categories assist in discovering the topics of each document. NER systems commonly use linguistic grammar based models as well as statistical models such as machine learning. Popular models such as Stanford CoreNLP and OpenNLP take a statistical approach to named-entity recognition. Statistical models require large amounts of hand-annotated data; however, some use semi-supervised approaches to get around this. Each time we classify a new sentence fragment, it cycles back into our data where the classifier will be retrained and used to achieve better results. VI. SENTIMENT ANALYSIS Sentiment Analysis attempts to extract subjective information from text. This stage of natural language processing does not provide any new information about the text such as entities and relations; however, it gives insight on the “vibe” of the text, such as positive and negative. For instance, when examining Twitter status, one can usually classify tweets as “positive” and “negative”. “Company A claimed bankruptcy”, would fall under “negative”, while the sentence “Company A sold for 2 million dollars” would be classified as “positive” sentiment. Sentiment analysis conducted over a period of time can reveal meaningful information about a company or entity over a period of time; for example, whether a company is doing well or seem to be going under. However, based on the basic construction of news articles, sentiment analysis proves to be slightly more difficult. When doing sentiment analysis on Twitter Tweets, it is often distinct on whether a tweet is “positive” or “negative”. A tweet on Twitter can often be something as positive such as “My Team is Winning!!!” or as negative as “THIS album SUCKS!”. However, news articles are often written in an unbiased way, and conducting sentiment analysis on them can prove to be a challenge. In an attempt to gain better results, we turned to supervised training to train our classifier to achieve better results. In this attempt, we took fragments of sentences from our texts, and hand classified each sentence fragment on a scale from “very negative” to “very positive”. Example of classifying fragments Sentiment Analysis Pipeline VII. STORAGE AND PROCESSING As the amount of data in the world has grown tremendously, so has the debate of SQL vs. NoSQL databases. SQL is the standard language for relational database management systems, and NoSQL attempts to use a key-value structure with no predetermined rows or columns that are seen in SQL databases. One of the most popular NoSQL databases is MongoDB. MongoDB carries no strict rules on data-relations which is the best fit for the unstructured format of our data. Articles that are imported into the system are stored using MongoDB GridFS, where they are exported to be processed and broken down using Stanford CoreNLP’s natural language processing techniques. Processed information are stored as shown below. with Soccer games. Articles that have these words in their text might fall under Topic 1. In addition, one document might fall under more than one topic. Document X could contain 40% topic 1 and 60% topic 2. These probabilities help us find which articles are relevant to the user’s query or another article. IX. QUERY AND USER INTERFACE MongoDB Document With all of this information available, we look to allow users access these articles. A user’s query may involve a simple collection of words that the user wants to know more about. An example would be “aircraft” or “power plants”. Taking the user’s query, we break it down to it individual components and user these to run against our LDA table. The articles that match up with the greatest probabilities are assumed to be the most relevant to the query. From here users are permitted to scan through the initial list of documents. Each documents contains links to top most relevant documents. A query for “aircraft” would receive any article that has to do with the term. X. CONCLUSION VIII. LATENT DIRICHLET ALLOCATION As we break down and process each unstructured news article, we require a technique that will allow us to pair relevant articles together. Clustering algorithms involve grouping similar elements together. We look at latent Dirichlet allocation. LDA is an unsupervised, statistical approach used for modeling text to find latent semantic topics in text documents. It is assumed that documents are roughly similar if they contain similar words. Words are treated as random variables. We can compute the probability distribution over each latent variables that are conditioned on the observed variables. Topic 1 Topic 2 Topic 3 term weight term weight term weight game 0.014 oil 0.021 food 0.021 team 0.011 Siemens 0.006 animal 0.015 soccer 0.009 electric 0.006 dog 0.013 play 0.008 power 0.005 healthy 0.0012 Example LDA table – 3 topics In the table above, 3 topics appeared. Each topics has a group of words that represent its respective topic. The terms in topics 1 seem to have something to do Using natural language process techniques and latent Dirichlet allocation, we aimed to achieve a system that is capable of delivering a system where users can search (query) for articles in a way where users aren’t just getting articles through a keyword search, but are given articles based on relevancy. This information could potentially shed light on relations between entities throughout time. XI. ACKNOWLEDGEMENT We wish to acknowledge the assistance and support of Siemens AG, Dr. Matthew Evans Director of Smarter Services at Siemens, Akshay Patwal Strategic Business Manager at Siemens, and Pawel Wocjan Professor at the University of Central Florida. XII. REFERENCES [1] Marie-Catherine de Marneffe, Christopher D. Manning, “Stanford typed dependencies manual” [2] Diane J. Hu, “Latent Dirichlet Allocation for Text, Images, and Music” [3] Ryan McDonald, Fernando Pereira, “Non-projective Dependency Parsing using Spanning Tree Algorithms”,